[Test Failure] Engine crash during TLS test with dual channel replication #1152

madolson · 2024-10-11T15:55:25Z

https://github.com/valkey-io/valkey/actions/runs/11283922852/job/31417233387#step:6:6008

It seems to occur on the replica side.

43813 valkey-server *
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fd198042520]
/lib/x86_64-linux-gnu/libssl.so.3(+0x32aca)[0x7fd1989dfaca]
/lib/x86_64-linux-gnu/libssl.so.3(SSL_read+0x27)[0x7fd1989dfcb7]
src/valkey-server 127.0.0.1:23382(+0x1fa987)[0x5566f92b0987]
src/valkey-server 127.0.0.1:23382(+0x1705f8)[0x5566f92265f8]
src/valkey-server 127.0.0.1:23382(rdbLoadRioWithLoadingCtx+0x2f5)[0x5566f91b54f5]
src/valkey-server 127.0.0.1:23382(readSyncBulkPayload+0x315)[0x5566f91a6075]
src/valkey-server 127.0.0.1:23382(+0x1f310d)[0x5566f92a910d]
src/valkey-server 127.0.0.1:23382(+0x1fb9a3)[0x5566f92b19a3]
src/valkey-server 127.0.0.1:23382(+0x1fbc34)[0x5566f92b1c34]
src/valkey-server 127.0.0.1:23382(beforeSleep+0x87)[0x5566f9152d17]
src/valkey-server 127.0.0.1:23382(aeMain+0x3e)[0x5566f914a73e]
src/valkey-server 127.0.0.1:23382(main+0x52f)[0x5566f913f7af]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fd198029d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fd198029e40]
src/valkey-server 127.0.0.1:23382(_start+0x25)[0x5566f913fea5]

The text was updated successfully, but these errors were encountered:

ranshid · 2024-10-14T05:47:02Z

I am still working on understanding the crash. Seems that maybe we tried reading from a deleted connection? (I cannot understand how possible from the code though, since as far as I can tell the code path nullifying the connections and establish new connection on retries.. )

There is also another fail in the valgrind test, which seems related to the fact that the replica <-> primary is timing out. maybe just increase the repl-timeout in case of valgrind run identification.

ranshid · 2024-10-14T17:17:27Z

So we were able to track down the reason for the crash:

during Dual channel load the replica maintains 2 connections with the primary - repl_transfer_s, in order to read the psync data from the primary and repl_rdb_transfer_s for the rdb data transfer. when the replica is loading the data to via socket, it is in a tight loop inside rdbLoadRioWithLoadingCtx. in Case the network with the primary is lost it will usually be identified on repl_rdb_transfer_s connection while reading the data from the socket in rioRead.
However in some cases it is possible that the disconnection is identified on the repl_transfer_s first during the loading phase. This can happen while we free the CPU to process some Epoll events in rdbLoadProgressCallback. In this case we identified the repl_transfer_s was disconnected during the rdbLoadProgressCallback and thus we cleanup all Replication connections and state in cancelReplicationHandshake which also closes and cleanup the repl_rdb_transfer_s in replicationAbortDualChannelSyncTransfer. this, however will then return to the rdbLoadRioWithLoadingCtx loop and attempt another rioRead on the rio connection which was never nullified or indicated as freed.

One possible fix is to add a flag to rio which indicates it should abort ASAP (which means on the next io operation)

…onnection handling Introduces a dedicated flag in provisional primary struct to signal immediate abort, preventing potential use-after-free scenarios during replication disconnection in dual-channel load. This ensures proper termination of rdbLoadRioWithLoadingCtx when replication is cancelled due to connection loss on main connection. Fixes valkey-io#1152 Signed-off-by: naglera <[email protected]>

in case of valgrind run, the replica might get disconnected from the primary due to repl-timeout reached. Fix is to configure larger timeout in case of valgrind test. **Partially** fixes: valkey-io#1152 Signed-off-by: Ran Shidlansik <[email protected]>

madolson added the test-failure An issue indicating a test failure label Oct 11, 2024

madolson assigned ranshid Oct 11, 2024

ranshid mentioned this issue Oct 14, 2024

Deflake test Primary COB growth with inactive replica #1165

Merged

enjoy-binbin closed this as completed in #1165 Oct 14, 2024

enjoy-binbin closed this as completed in 597aa03 Oct 14, 2024

enjoy-binbin reopened this Oct 14, 2024

ranshid assigned naglera Oct 14, 2024

ranshid added the bug Something isn't working label Oct 14, 2024

naglera linked a pull request Oct 15, 2024 that will close this issue

Add ASAP abort flag to provisional primary for safer replication disconnection handling #1173

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Test Failure] Engine crash during TLS test with dual channel replication #1152

[Test Failure] Engine crash during TLS test with dual channel replication #1152

madolson commented Oct 11, 2024

ranshid commented Oct 14, 2024 •

edited

Loading

ranshid commented Oct 14, 2024

[Test Failure] Engine crash during TLS test with dual channel replication #1152

[Test Failure] Engine crash during TLS test with dual channel replication #1152

Comments

madolson commented Oct 11, 2024

ranshid commented Oct 14, 2024 • edited Loading

ranshid commented Oct 14, 2024

ranshid commented Oct 14, 2024 •

edited

Loading