Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark the node as FAIL when the node is marked as NOADDR #1191

Open
wants to merge 3 commits into
base: unstable
Choose a base branch
from

Conversation

enjoy-binbin
Copy link
Member

Imagine we have a cluster, for example a three-shard cluster,
if shard 1 doing a CLUSTER RESET HARD, it will change the node
name, and then other nodes will mark it as NOADR since the node
name received by PONG has changed.

In the eyes of other nodes, there is one working primary node
left but with no address, and in this case, the address report
in MOVED will be invalid and will confuse the clients. And in
the same time, the replica will not failover since its primary
is not in the FAIL state. And the cluster looks OK to everyone.

This leaves a cluster that appears OK, but with no coverage for
shard 1, obviously we should do something like CLUSTER FORGET
to remove the node and fix the cluster before using it.

But the point in here, we can mark the NOADDR node as FAIL to
advance the cluster state. If a node is NOADDR means it does
not have a valid address, so we won't reconnect it, we won't
send PING, we won't gossip it, it seems reasonable to mark it
as FAIL.

Imagine we have a cluster, for example a three-shard cluster,
if shard 1 doing a CLUSTER RESET HARD, it will change the node
name, and then other nodes will mark it as NOADR since the node
name received by PONG has changed.

In the eyes of other nodes, there is one working primary node
left but with no address, and in this case, the address report
in MOVED will be invalid and will confuse the clients. And in
the same time, the replica will not failover since its primary
is not in the FAIL state. And the cluster looks OK to everyone.

This leaves a cluster that appears OK, but with no coverage for
shard 1, obviously we should do something like CLUSTER FORGET
to remove the node and fix the cluster before using it.

But the point in here, we can mark the NOADDR node as FAIL to
advance the cluster state. If a node is NOADDR means it does
not have a valid address, so we won't reconnect it, we won't
send PING, we won't gossip it, it seems reasonable to mark it
as FAIL.

Signed-off-by: Binbin <[email protected]>
Copy link

codecov bot commented Oct 18, 2024

Codecov Report

Attention: Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.

Project coverage is 70.70%. Comparing base (a62d1f1) to head (4d2780a).
Report is 2 commits behind head on unstable.

Files with missing lines Patch % Lines
src/cluster_legacy.c 66.66% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1191      +/-   ##
============================================
+ Coverage     70.65%   70.70%   +0.04%     
============================================
  Files           114      114              
  Lines         61799    63076    +1277     
============================================
+ Hits          43664    44596     +932     
- Misses        18135    18480     +345     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.59% <66.66%> (+0.27%) ⬆️

... and 92 files with indirect coverage changes

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be correct, but I want someone else to take a look. @PingXie?

So when a node is in NOADDR state, it will never be PFAIL and later FAIL? We never get any updates from the old node ID so it should automatically become PFAIL at some point?

Are there any other case where a node can be marked as NOADDR and come back again? Changed IP address of the server but still running?

src/cluster_legacy.c Outdated Show resolved Hide resolved
@enjoy-binbin
Copy link
Member Author

So when a node is in NOADDR state, it will never be PFAIL and later FAIL? We never get any updates from the old node ID so it should automatically become PFAIL at some point?

yes, it won't never be PFAIL or FAIL. It also won't automatically become PFAIL since in clusterCron, we will skip the NOADDR node for the timeout check.

Are there any other case where a node can be marked as NOADDR and come back again? Changed IP address of the server but still running?

Maybe, but i am not aware of it. If the IP changed, we will call nodeUpdateAddressIfNeeded to update it, so it won't be NOADDR state.

@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Oct 19, 2024
@zuiderkwast
Copy link
Contributor

Just an idea: Can we set it to PFAIL? It is unreachable in myself's view but maybe another node can reach it somehow? To mark a node as FAIL is usually a majority decision. We can wait for a majority of nodes to mark it as FAIL, but it takes more time. Is that a problem?

@enjoy-binbin
Copy link
Member Author

we won't include the noaddr node in the gossip section. That is the problem so we will never get the majority

@zuiderkwast
Copy link
Contributor

we won't include the noaddr node in the gossip section. That is the problem so we will never get the majority

I guess another option is to start including it in the gossip section then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants