[RW Separation] Change search replica recovery flow #15952

mch2 · 2024-09-16T17:25:00Z

Search replicas are currently configured to recover with the Peer recovery flow. To ensure these shards have reduced (node-node replication) and zero (remote backed) communication with primary shards during recovery we need to change this.

In both cases, I think we can initialize the shards with store recovery steps and then run a round of replication before marking the shards as active. I think we may also need to filter out these allocation Ids from cluster manager updates to the primary, as they should not care about tracking them as in-sync copies.

prudhvigodithi · 2024-10-23T16:50:32Z

Hey @mch2 from my understanding I have summarized the following on how we can remove the Peer recovery for the search replicas, please check

Filtering Allocation IDs in Cluster Updates:

In OpenSearch, the cluster manager tracks which shards are in-sync with the primary shard (we can check using /_cluster/state). This helps in ensuring data consistency across the cluster. The proposal suggests filtering out the allocation IDs of search replicas from these updates. Since search replicas are read-only and do not participate in writes, the cluster manager doesn’t need to track their state as in-sync copies like it does for regular replicas.

We can create a seperate child issue to track this?

Search Replica Recovery

Check Local Storage and Recover from Local Disk

The replica checks its own disk to see if it already has the necessary segment files from previous operations or previous shard instantiations. If the required data is on the search replica’s local disk, it uses that to recover the shard, skipping any communication with the primary node.

Recover from Remote Storage (if configured):

If the data is not available on the local disk, the replica can pull the data from remote-backed storage to recover the shard. We can even consider skipping the local disk check if Remote Storage is configured.

Replication Process (optional): Can be triggered after the recovery or can wait for next refresh interval

After initializing the replica via store recovery (local or remote), the process can start to ensure the shard is synchronized with the latest segments.

Thank you
@getsaurabh02

prudhvigodithi · 2024-10-23T18:38:50Z

So in example API curl -X GET "http://localhost:9200/_cluster/state/routing_table?filter_path=routing_table.indices.my_sample_index" -H 'Content-Type: application/json' | jq '.' I can see the "state": "STARTED" for "searchOnly": true, is the proposal to remove this so that cluster manager need not track the searchOnly replicas?

{
  "routing_table": {
    "indices": {
      "my_sample_index": {
        "shards": {
          "0": [
            {
              "state": "STARTED",
              "primary": false,
              "searchOnly": false,
              "node": "BfhYEGSERsOMVxXa1BTFFA",
              "relocating_node": null,
              "shard": 0,
              "index": "my_sample_index",
              "allocation_id": {
                "id": "6TMmzm-6S0ip36g5VsWSog"
              }
            },
            {
              "state": "STARTED",
              "primary": false,
              "searchOnly": true,
              "node": "yOfxLMoGQPeVJ1lcHNPMrw",
              "relocating_node": null,
              "shard": 0,
              "index": "my_sample_index",
              "allocation_id": {
                "id": "icMnWUYjTReUN0J2H6UUlA"
              }
            },
            {
              "state": "STARTED",
              "primary": true,
              "searchOnly": false,
              "node": "KCszv1QKSZ-HHqzuAcoYbg",
              "relocating_node": null,
              "shard": 0,
              "index": "my_sample_index",
              "allocation_id": {
                "id": "Y6CRS10hT26SbpkdaiwMsQ"
              }
            }
          ]
        }
      }
    }
  }
}

mch2 · 2024-12-02T21:21:11Z

is the proposal to remove this so that cluster manager need not track the searchOnly replicas?

No, the shards should still be included of the routing table, just not tracked at the primary.

@prudhvigodithi pls see the linked draft for what i'm thinking here. To simplify i've reduced the scope of separation to only remote store enabled domains at least for now.

mch2 mentioned this issue Sep 16, 2024

[META] Reader/Writer Separation #15306

Open

14 tasks

github-actions bot added the untriaged label Sep 16, 2024

mch2 changed the title ~~Change search replica recovery flow.~~ [RW Separation] Change search replica recovery flow Sep 16, 2024

mch2 added v2.18.0 Issues and PRs related to version 2.18.0 and removed untriaged labels Sep 16, 2024

mch2 added this to Performance Roadmap Sep 16, 2024

github-project-automation bot moved this to Todo in Performance Roadmap Sep 16, 2024

mch2 self-assigned this Sep 16, 2024

mch2 moved this from Todo to In Progress in Performance Roadmap Sep 16, 2024

mch2 added v2.19.0 Issues and PRs related to version 2.19.0 and removed v2.18.0 Issues and PRs related to version 2.18.0 labels Oct 24, 2024

mch2 linked a pull request Dec 2, 2024 that will close this issue

Limit RW separation to remote store enabled clusters and update recovery flow #16760

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RW Separation] Change search replica recovery flow #15952

[RW Separation] Change search replica recovery flow #15952

mch2 commented Sep 16, 2024 •

edited

Loading

prudhvigodithi commented Oct 23, 2024

prudhvigodithi commented Oct 23, 2024

mch2 commented Dec 2, 2024

[RW Separation] Change search replica recovery flow #15952

[RW Separation] Change search replica recovery flow #15952

Comments

mch2 commented Sep 16, 2024 • edited Loading

prudhvigodithi commented Oct 23, 2024

Filtering Allocation IDs in Cluster Updates:

Search Replica Recovery

Check Local Storage and Recover from Local Disk

Recover from Remote Storage (if configured):

Replication Process (optional): Can be triggered after the recovery or can wait for next refresh interval

prudhvigodithi commented Oct 23, 2024

mch2 commented Dec 2, 2024

mch2 commented Sep 16, 2024 •

edited

Loading