[DBZ] Read offset from Kafka for every commit callback #348

vaibhav-yb · 2024-08-14T04:42:21Z

Note: This PR breaks explicit checkpointing in the tablet splitting case, it will be fixed in a follow-up PR.

Problem

There are two issues with the current checkpointing mechanism using the callbacks:

They both are potential cause of data loss and have been reproduced manually as well.

DBZ-6026 -

The current code uses a method BaseSourceTask#logStatistics to log some information, however, it also ends up updating the offset map which is then used for a callback.

However, with this, there's a possibility that if there's a commit() callback after the statistics are logged and before the records are returned from the BaseSourceTask#poll, it will end up marking the checkpoint on service but if the connector restarts in that window, there will be a data loss.

Steps to reproduce:

Put a sleep of 2 minutes after logStatistics() is called and before records are returned from BaseSourceTask#poll.
a. Add a log so that we see when the method is getting called.
Create a table with a single tablet.
Insert a record and wait for the log in 1a to appear, that will indicate that records are not yet returned.
Wait for commit callback - this generally comes before the sleep gets over.
Restart the connector.
Upon restart, the connector will start from the checkpoint in step 4 and the record inserted in step 3 will never be streamed.

DBZ-7816 -

According to Kafka docs, callbacks for the same Kafka partition are guaranteed to be in order but callbacks for different Kafka partitions can come out of order which can lead to a potential data loss window as mentioned in DBZ-7816

Steps to reproduce:

Create a table with 20 tablets and fill it with 100k records.
Create a topic with 50 partitions.
Start snapshot on the table with snapshot.mode=initial
Once the snapshot is finished, the tablet will be added to wait list and if the callbacks come out of order, it is possible that we will never receive a callback for the last snapshot record and we will never transition from snapshot to streaming.

Note that out of all the experiments performed for this issue, we were able to reproduce it 100% of the times but there's a possibility that it might not reproduce if callbacks are in order (1 out of 10 times maybe?)

Solution

This PR includes the changes to override the commit() method in YugabyteDBConnectorTask and reads the offsets from Kafka partitions and uses the same offsets to send via commit callbacks so that they can be marked on service for checkpointing.

This reverts commit db1d868.

src/test/java/io/debezium/connector/yugabytedb/YugabyteDBSnapshotTest.java

vaibhav-yb · 2024-08-22T11:33:42Z

src/main/java/io/debezium/connector/yugabytedb/YugabyteDBStreamingChangeEventSource.java

@@ -940,7 +940,7 @@ public void commitOffset(Map<String, ?> offset) {
                    // than one already present would throw the error: CDCSDK: Trying to fetch already GCed intents
                    if (this.tabletToExplicitCheckpoint.get(entry.getKey()) != null &&
                            tempOpId.getIndex() < this.tabletToExplicitCheckpoint.get(entry.getKey()).getIndex()) {
-                        LOGGER.warn("The received OpId {} is less than the older checkpoint {} for tablet {}",
+                        LOGGER.debug("The received OpId {} is less than the older checkpoint {} for tablet {}",


This is converted to debug level considering the following scenario:

Last record published by the connector has OpId 1.5

Connector received a callback on the above OpId so explicit checkpoint is also at 1.5

Now there were couple of NO_OP on service so connector received empty batches and explicit checkpoint was advanced to 1.8

Assume that for sometime there are no records, or there are no records after the record in step 1

This will result in a commit callback with checkpoint 1.5 since that was the last published record

Subsequently, we will always end up printing this warning log which could be confusing to users.

src/main/java/io/debezium/connector/yugabytedb/YugabyteDBConnectorTask.java

src/main/java/io/debezium/connector/yugabytedb/YugabyteDBChangeEventSourceCoordinator.java

src/main/java/io/debezium/connector/yugabytedb/YugabyteDBConnectorTask.java

vaibhav-yb added 2 commits August 14, 2024 00:07

overridden commit callback

3769657

use coordinator to track snapshot phase

cf1743f

vaibhav-yb added the bug Something isn't working label Aug 14, 2024

vaibhav-yb requested a review from suranjan August 14, 2024 04:42

vaibhav-yb self-assigned this Aug 14, 2024

vaibhav-yb added 7 commits August 14, 2024 04:43

working changes

24f01e5

removed code to publish last snapshot record

d997dd6

compilation error fix

dec8cdd

fixed callback bug in tablet splitting

db1d868

Revert "fixed callback bug in tablet splitting"

79de30e

This reverts commit db1d868.

changed issues

a03ad6b

reverted version in pom

6c1c476

vaibhav-yb commented Aug 15, 2024

View reviewed changes

src/test/java/io/debezium/connector/yugabytedb/YugabyteDBSnapshotTest.java Outdated Show resolved Hide resolved

vaibhav-yb added 6 commits August 16, 2024 12:29

return an empty incremental snapshot change event source

7cdfe4c

added logs to investigate NPE

0c594e2

changes

17369ea

fix NPE

a4238ec

self review comments

cbc88ff

fixed incorrect formatting

c50f8bb

vaibhav-yb commented Aug 22, 2024

View reviewed changes

suranjan approved these changes Aug 26, 2024

View reviewed changes

src/main/java/io/debezium/connector/yugabytedb/YugabyteDBConnectorTask.java Show resolved Hide resolved

src/main/java/io/debezium/connector/yugabytedb/YugabyteDBConnectorTask.java Show resolved Hide resolved

vaibhav-yb added 4 commits August 26, 2024 16:47

fixes

7818e0d

added task ID to log

70005f6

returning preemptively if snapshot source is null

bbc0855

reverted log change

94320f9

suranjan requested changes Aug 28, 2024

View reviewed changes

addressed review comments

d256128

suranjan approved these changes Aug 28, 2024

View reviewed changes

vaibhav-yb added 2 commits August 30, 2024 18:21

Merge branch 'main' into override-commit-callback

20e476d

resolved merge conflicts with main

ba7d00d

vaibhav-yb merged commit 46cc366 into yugabyte:main Aug 30, 2024
0 of 2 checks passed

This was referenced Sep 3, 2024

[DBZ] Bug fix for snapshot not transitioning to streaming #353

Merged

[DBZ] Use full partition ID for colocated tables #346

Closed

[DBZ] Read offset from the topic to mark explicit checkpoints #345

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DBZ] Read offset from Kafka for every commit callback #348

[DBZ] Read offset from Kafka for every commit callback #348

vaibhav-yb commented Aug 14, 2024 •

edited

Loading

vaibhav-yb Aug 22, 2024

[DBZ] Read offset from Kafka for every commit callback #348

[DBZ] Read offset from Kafka for every commit callback #348

Conversation

vaibhav-yb commented Aug 14, 2024 • edited Loading

Note: This PR breaks explicit checkpointing in the tablet splitting case, it will be fixed in a follow-up PR.

Problem

Solution

vaibhav-yb Aug 22, 2024

Choose a reason for hiding this comment

vaibhav-yb commented Aug 14, 2024 •

edited

Loading