-
Notifications
You must be signed in to change notification settings - Fork 626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Test Failure] Engine crash during TLS test with dual channel replication #1152
Comments
I am still working on understanding the crash. Seems that maybe we tried reading from a deleted connection? (I cannot understand how possible from the code though, since as far as I can tell the code path nullifying the connections and establish new connection on retries.. ) There is also another fail in the valgrind test, which seems related to the fact that the replica <-> primary is timing out. maybe just increase the repl-timeout in case of valgrind run identification. |
So we were able to track down the reason for the crash: during Dual channel load the replica maintains 2 connections with the primary - repl_transfer_s, in order to read the psync data from the primary and repl_rdb_transfer_s for the rdb data transfer. when the replica is loading the data to via socket, it is in a tight loop inside rdbLoadRioWithLoadingCtx. in Case the network with the primary is lost it will usually be identified on repl_rdb_transfer_s connection while reading the data from the socket in rioRead. One possible fix is to add a flag to rio which indicates it should abort ASAP (which means on the next io operation) |
…onnection handling Introduces a dedicated flag in provisional primary struct to signal immediate abort, preventing potential use-after-free scenarios during replication disconnection in dual-channel load. This ensures proper termination of rdbLoadRioWithLoadingCtx when replication is cancelled due to connection loss on main connection. Fixes valkey-io#1152 Signed-off-by: naglera <[email protected]>
…onnection handling Introduces a dedicated flag in provisional primary struct to signal immediate abort, preventing potential use-after-free scenarios during replication disconnection in dual-channel load. This ensures proper termination of rdbLoadRioWithLoadingCtx when replication is cancelled due to connection loss on main connection. Fixes valkey-io#1152 Signed-off-by: naglera <[email protected]>
…onnection handling Introduces a dedicated flag in provisional primary struct to signal immediate abort, preventing potential use-after-free scenarios during replication disconnection in dual-channel load. This ensures proper termination of rdbLoadRioWithLoadingCtx when replication is cancelled due to connection loss on main connection. Fixes valkey-io#1152 Signed-off-by: naglera <[email protected]>
in case of valgrind run, the replica might get disconnected from the primary due to repl-timeout reached. Fix is to configure larger timeout in case of valgrind test. **Partially** fixes: valkey-io#1152 Signed-off-by: Ran Shidlansik <[email protected]>
https://github.com/valkey-io/valkey/actions/runs/11283922852/job/31417233387#step:6:6008
It seems to occur on the replica side.
The text was updated successfully, but these errors were encountered: