You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm finding that if the tap runs after multiple updates to a single row while using LOG_BASED replication, duplicate records are being generated. What follows is an slightly modified example that I ran into in production.
I have the following lines in my binlog:
/*!100001 SET @@session.gtid_seq_no=45555058*//*!*/;
BEGIN
/*!*/;
# at 30263912
# at 30264365
#201223 9:16:23 server id 1 end_log_pos 30264365 CRC32 0x99c7ce0c Annotate_rows:
#Q> insert into `contracts`.`contracts` (`autorenewal`, `autorenewal_cancelled_at`, `autorenewal_end_date`, `contract_signed_at`, `exclusive`, `guid`, `id`, `iso`, `other_charges`, `other_fee`, `payment_terms_id`, `price`, `production_fee`, `production_number`, `sales_period`, `service_fee`, `status`) values ('yes', NULL, NULL, '2020-12-23 17:13:56', false, 43124, NULL, 'TSD', 0, 0, 4, 2689, 0, 'f1kOuqCqjH', '202103', 0, 'open')
#201223 9:16:23 server id 1 end_log_pos 30264465 CRC32 0x72aa7fe7 Table_map: `contracts`.`contracts` mapped to number 7644
# at 30264465
#201223 9:16:23 server id 1 end_log_pos 30264573 CRC32 0xa4a70f8f Write_rows: table id 7644 flags: STMT_END_F
BINLOG '
...
'/*!*/;
# Number of rows: 1
# at 30264573
# at 30264688
#201223 9:16:23 server id 1 end_log_pos 30264688 CRC32 0x878b81b4 Annotate_rows:
#Q> update `contracts`.`contracts` set `production_number` = '[ACTUAL VALUE]' where `id` = 1768
#201223 9:16:23 server id 1 end_log_pos 30264788 CRC32 0xe0cf2065 Table_map: `contracts`.`contracts` mapped to number 7644
# at 30264788
#201223 9:16:23 server id 1 end_log_pos 30264976 CRC32 0x21fc0f4c Update_rows: table id 7644 flags: STMT_END_F
BINLOG '
...
'/*!*/;
In the first query, a row is inserted into the table containing a placeholder production_number. In the second query, the production_number is updated to the actual value.
When I look at the tap-mysql output (and what is uploaded via target-bigquery), I see that there are two records distinguishable only in sequence:
As you can see from the binlog, the two records are distinct and should be output. However, there is no way to dedupe the rows once they have been uploaded to another SQL service like Google BigQuery - all metadata (version and time_extracted) is identical in both rows.
The text was updated successfully, but these errors were encountered:
I think the fix is to add additional metadata indicating order. Stitch does this with the _sdc_sequence column, so I think that is a good place to start.
I'm finding that if the tap runs after multiple updates to a single row while using LOG_BASED replication, duplicate records are being generated. What follows is an slightly modified example that I ran into in production.
I have the following lines in my binlog:
In the first query, a row is inserted into the table containing a placeholder
production_number
. In the second query, theproduction_number
is updated to the actual value.When I look at the tap-mysql output (and what is uploaded via target-bigquery), I see that there are two records distinguishable only in sequence:
As you can see from the binlog, the two records are distinct and should be output. However, there is no way to dedupe the rows once they have been uploaded to another SQL service like Google BigQuery - all metadata (
version
andtime_extracted
) is identical in both rows.The text was updated successfully, but these errors were encountered: