Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Test Failure] Engine crash during TLS test with dual channel replication #1152

Open
madolson opened this issue Oct 11, 2024 · 2 comments · Fixed by #1165 · May be fixed by #1173
Open

[Test Failure] Engine crash during TLS test with dual channel replication #1152

madolson opened this issue Oct 11, 2024 · 2 comments · Fixed by #1165 · May be fixed by #1173
Assignees
Labels
bug Something isn't working test-failure An issue indicating a test failure

Comments

@madolson
Copy link
Member

https://github.com/valkey-io/valkey/actions/runs/11283922852/job/31417233387#step:6:6008

It seems to occur on the replica side.

43813 valkey-server *
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fd198042520]
/lib/x86_64-linux-gnu/libssl.so.3(+0x32aca)[0x7fd1989dfaca]
/lib/x86_64-linux-gnu/libssl.so.3(SSL_read+0x27)[0x7fd1989dfcb7]
src/valkey-server 127.0.0.1:23382(+0x1fa987)[0x5566f92b0987]
src/valkey-server 127.0.0.1:23382(+0x1705f8)[0x5566f92265f8]
src/valkey-server 127.0.0.1:23382(rdbLoadRioWithLoadingCtx+0x2f5)[0x5566f91b54f5]
src/valkey-server 127.0.0.1:23382(readSyncBulkPayload+0x315)[0x5566f91a6075]
src/valkey-server 127.0.0.1:23382(+0x1f310d)[0x5566f92a910d]
src/valkey-server 127.0.0.1:23382(+0x1fb9a3)[0x5566f92b19a3]
src/valkey-server 127.0.0.1:23382(+0x1fbc34)[0x5566f92b1c34]
src/valkey-server 127.0.0.1:23382(beforeSleep+0x87)[0x5566f9152d17]
src/valkey-server 127.0.0.1:23382(aeMain+0x3e)[0x5566f914a73e]
src/valkey-server 127.0.0.1:23382(main+0x52f)[0x5566f913f7af]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fd198029d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fd198029e40]
src/valkey-server 127.0.0.1:23382(_start+0x25)[0x5566f913fea5]
@madolson madolson added the test-failure An issue indicating a test failure label Oct 11, 2024
@ranshid
Copy link
Member

ranshid commented Oct 14, 2024

I am still working on understanding the crash. Seems that maybe we tried reading from a deleted connection? (I cannot understand how possible from the code though, since as far as I can tell the code path nullifying the connections and establish new connection on retries.. )

There is also another fail in the valgrind test, which seems related to the fact that the replica <-> primary is timing out. maybe just increase the repl-timeout in case of valgrind run identification.

@ranshid
Copy link
Member

ranshid commented Oct 14, 2024

So we were able to track down the reason for the crash:

during Dual channel load the replica maintains 2 connections with the primary - repl_transfer_s, in order to read the psync data from the primary and repl_rdb_transfer_s for the rdb data transfer. when the replica is loading the data to via socket, it is in a tight loop inside rdbLoadRioWithLoadingCtx. in Case the network with the primary is lost it will usually be identified on repl_rdb_transfer_s connection while reading the data from the socket in rioRead.
However in some cases it is possible that the disconnection is identified on the repl_transfer_s first during the loading phase. This can happen while we free the CPU to process some Epoll events in rdbLoadProgressCallback. In this case we identified the repl_transfer_s was disconnected during the rdbLoadProgressCallback and thus we cleanup all Replication connections and state in cancelReplicationHandshake which also closes and cleanup the repl_rdb_transfer_s in replicationAbortDualChannelSyncTransfer. this, however will then return to the rdbLoadRioWithLoadingCtx loop and attempt another rioRead on the rio connection which was never nullified or indicated as freed.

One possible fix is to add a flag to rio which indicates it should abort ASAP (which means on the next io operation)

@ranshid ranshid added the bug Something isn't working label Oct 14, 2024
naglera added a commit to naglera/placeholderkv that referenced this issue Oct 15, 2024
…onnection handling

Introduces a dedicated flag in provisional primary struct to signal immediate
abort, preventing potential use-after-free scenarios during replication
disconnection in dual-channel load. This ensures proper termination of
rdbLoadRioWithLoadingCtx when replication is cancelled due to connection loss
on main connection.

Fixes valkey-io#1152

Signed-off-by: naglera <[email protected]>
naglera added a commit to naglera/placeholderkv that referenced this issue Oct 15, 2024
…onnection handling

Introduces a dedicated flag in provisional primary struct to signal immediate
abort, preventing potential use-after-free scenarios during replication
disconnection in dual-channel load. This ensures proper termination of
rdbLoadRioWithLoadingCtx when replication is cancelled due to connection loss
on main connection.

Fixes valkey-io#1152

Signed-off-by: naglera <[email protected]>
naglera added a commit to naglera/placeholderkv that referenced this issue Oct 16, 2024
…onnection handling

Introduces a dedicated flag in provisional primary struct to signal immediate
abort, preventing potential use-after-free scenarios during replication
disconnection in dual-channel load. This ensures proper termination of
rdbLoadRioWithLoadingCtx when replication is cancelled due to connection loss
on main connection.

Fixes valkey-io#1152

Signed-off-by: naglera <[email protected]>
eifrah-aws pushed a commit to eifrah-aws/valkey that referenced this issue Oct 20, 2024
in case of valgrind run, the replica might get disconnected from the
primary due to repl-timeout reached. Fix is to configure larger timeout
in case of valgrind test.

**Partially** fixes: valkey-io#1152

Signed-off-by: Ran Shidlansik <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working test-failure An issue indicating a test failure
Projects
None yet
4 participants