Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Auto promotion not get triggered when master leader experience network failure/degradation. #16848

Open
amberzsy opened this issue Dec 15, 2024 · 3 comments
Labels
bug Something isn't working Cluster Manager

Comments

@amberzsy
Copy link

amberzsy commented Dec 15, 2024

Describe the bug

the cluster has 3 master nodes and 50+ data nodes in OpenSearch cluster. During the network failure/high network degradation on master leader node, bunch of data nodes failed on master leader check and got "disconnected" with master leader. On master node side, those data nodes got excluded/removed from cluster due to the failure on follower check and failure on cluster state publish process. (note, master leader at this point, still processing, publishing logs and updating cluster state etc)
It further leads massive shard relocation or Red state in some extreme cases(60% data nodes marked as disconnected and removed by master).

Related component

Cluster Manager

To Reproduce

  1. set up cluster with 3 master nodes (1 leader and 2 standby). and couple of data nodes.
  2. trigger network degradation only on master leader node. (or trigger network layer packet drop etc) for more than 5mins.
  3. check master leader and data nodes log if there's follower/leader check failures and data nodes starting get removed from master leader.

Expected behavior

Ideally, what would be expected is during network degradation/failures on Mater leader, it would automatically promote or elect one of the two standby to leader. However, it didn't happen.

We tried with other scenarios as mentioned below, and auto promotion is working properly.

  1. trigger gracefully shutdown on master leader. The standby master-eligible node is able to be promoted
  2. trigger ungracefully shutdown on leader (e.g kill -9 the master leader process while it's running). The standby master-eligible node is able to be promoted and keep running the cluster. Data nodes can update the leader info and re-communicate with newly elected leader.

Additional Details

Plugins
opensearch-alerting
opensearch-anomaly-detection
opensearch-asynchronous-search
opensearch-cross-cluster-replication
opensearch-custom-codecs
opensearch-flow-framework
opensearch-geospatial
opensearch-index-management
opensearch-job-scheduler
opensearch-knn
opensearch-ml
opensearch-neural-search
opensearch-notifications
opensearch-notifications-core
opensearch-observability
opensearch-performance-analyzer
opensearch-reports-scheduler
opensearch-security
opensearch-skills
opensearch-sql
prometheus-exporter
repository-gcs
repository-s3

Screenshots
N/A

Host/Environment (please complete the following information):

  • OS: Linux
  • Version 2.16.1

Additional context
N/A

@shwetathareja
Copy link
Member

shwetathareja commented Dec 17, 2024

@amberzsy can you share the logs from all 3 cluster manager (aka master). Also did you check if standby masters were also network partitioned and unable to ping each other?

@rajiv-kv
Copy link
Contributor

[Triage Attendees - 1, 2, 3]

Thanks @amberzsy for filing the issue. Could you provide us the following details

  • Logs from stand-by nodes as to why the n/w disruption was not detected
  • Timeline of events (Did the cluster eventually recover ?)
  • Steps to reproduce (Could you explain as to how packet loss / network isolation was acheived)

Please take a look at the existing disruption tests and see if there is a relevant one that can be referenced for this issue.

@andrross
Copy link
Member

Unfortunately I haven't found any good documentation, but the setting index.unassigned.node_left.delayed_timeout can be tuned to increase the time before the system will reallocate replicas when a node is lost for any reason. This avoids starting expensive reallocations if the node is likely to return (i.e. in the case of an unstable network).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cluster Manager
Projects
Status: 🆕 New
Development

No branches or pull requests

4 participants