Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LOG_BASED replication produces duplicate rows #141

Open
skjbulcher opened this issue Dec 23, 2020 · 2 comments
Open

LOG_BASED replication produces duplicate rows #141

skjbulcher opened this issue Dec 23, 2020 · 2 comments

Comments

@skjbulcher
Copy link

I'm finding that if the tap runs after multiple updates to a single row while using LOG_BASED replication, duplicate records are being generated. What follows is an slightly modified example that I ran into in production.

I have the following lines in my binlog:

/*!100001 SET @@session.gtid_seq_no=45555058*//*!*/;
BEGIN
/*!*/;
# at 30263912
# at 30264365
#201223  9:16:23 server id 1  end_log_pos 30264365 CRC32 0x99c7ce0c     Annotate_rows:
#Q> insert into `contracts`.`contracts` (`autorenewal`, `autorenewal_cancelled_at`, `autorenewal_end_date`, `contract_signed_at`, `exclusive`, `guid`, `id`, `iso`, `other_charges`, `other_fee`, `payment_terms_id`, `price`, `production_fee`, `production_number`, `sales_period`, `service_fee`, `status`) values ('yes', NULL, NULL, '2020-12-23 17:13:56', false, 43124, NULL, 'TSD', 0, 0, 4, 2689, 0, 'f1kOuqCqjH', '202103', 0, 'open')
#201223  9:16:23 server id 1  end_log_pos 30264465 CRC32 0x72aa7fe7     Table_map: `contracts`.`contracts` mapped to number 7644
# at 30264465
#201223  9:16:23 server id 1  end_log_pos 30264573 CRC32 0xa4a70f8f     Write_rows: table id 7644 flags: STMT_END_F

BINLOG '
...
'/*!*/;
# Number of rows: 1
# at 30264573
# at 30264688
#201223  9:16:23 server id 1  end_log_pos 30264688 CRC32 0x878b81b4     Annotate_rows:
#Q> update `contracts`.`contracts` set `production_number` = '[ACTUAL VALUE]' where `id` = 1768
#201223  9:16:23 server id 1  end_log_pos 30264788 CRC32 0xe0cf2065     Table_map: `contracts`.`contracts` mapped to number 7644
# at 30264788
#201223  9:16:23 server id 1  end_log_pos 30264976 CRC32 0x21fc0f4c     Update_rows: table id 7644 flags: STMT_END_F

BINLOG '
...
'/*!*/;

In the first query, a row is inserted into the table containing a placeholder production_number. In the second query, the production_number is updated to the actual value.

When I look at the tap-mysql output (and what is uploaded via target-bigquery), I see that there are two records distinguishable only in sequence:

{"type": "RECORD", "stream": "contracts", "record": {"id": 1768, "guid": 43124, "status": "open", "iso": "MNM", "sales_period": "202103", "production_number": "f1kOuqCqjH", "exclusive": false, "autorenewal": "yes", "autorenewal_end_date": null, "price": 2689.00, "production_fee": 0.00, "other_fee": 0.00, "service_fee": 0.00, "other_charges": 0.00, "payment_terms_id": 4, "contract_signed_at": "2020-12-23T08:00:00.000000Z", "autorenewal_cancelled_at": null, "entered_at": "2020-12-23T17:16:23.000000Z", "_sdc_deleted_at": null}, "version": 1607545454071, "time_extracted": "2020-12-23T17:20:20.966775Z"}
{"type": "RECORD", "stream": "contracts", "record": {"id": 1768, "guid": 43124, "status": "open", "iso": "MNM", "sales_period": "202103", "production_number": "[ACTUAL VALUE]", "exclusive": false, "autorenewal": "yes", "autorenewal_end_date": null, "price": 2689.00, "production_fee": 0.00, "other_fee": 0.00, "service_fee": 0.00, "other_charges": 0.00, "payment_terms_id": 4, "contract_signed_at": "2020-12-23T08:00:00.000000Z", "autorenewal_cancelled_at": null, "entered_at": "2020-12-23T17:16:23.000000Z", "_sdc_deleted_at": null}, "version": 1607545454071, "time_extracted": "2020-12-23T17:20:20.966775Z"}

As you can see from the binlog, the two records are distinct and should be output. However, there is no way to dedupe the rows once they have been uploaded to another SQL service like Google BigQuery - all metadata (version and time_extracted) is identical in both rows.

@skjbulcher
Copy link
Author

I think the fix is to add additional metadata indicating order. Stitch does this with the _sdc_sequence column, so I think that is a good place to start.

@skjbulcher
Copy link
Author

OTOH, I may be misunderstanding the intent of _sdc_sequence. Is that better handled with the tap or the target?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant