Compatibility with segment replication #974

dreamer-89 · 2023-06-29T02:09:03Z

Summary

With 2.9.0 release, there are lot of enhancements going in for segment replication[1][2] feature (went GA in 2.7.0), we need to ensure different plugins are compatible with current state of this feature. Previously, we ran tests on plugin repos to verify this compatibility but want plugin owners to be aware of these changes so that required updates (if any) can be made. With 2.10.0 release, remote store feature is going GA which internally uses SEGMENT replication strategy only i.e. it enforces all indices to use SEGMENT replication strategy. So, it is important to validate plugins are compatible with segment replication feature.

What changed

1. Refresh policy behavior

RefreshPolicy.IMMEDIATE will only refresh primary shards but not replica shards immediately. Instead post refresh, primary will start a round of segment replication to update the replica shard copies leading to eventual consistency.
RefreshPolicy.WAIT_UNTIL ensures the indexing operation is searchable in your cluster i.e. RAW (Read after write guarantee). With segment replication, this guarantee is not promised due to delay in replica shared updates from asynchronous background refreshes.

2. Refresh lag on replicas

With segment replication, there is inherent delay in documents to be searchable on replica shard copies. This is due to the fact that replica shard copies over data (segment) files from primary. Thus, compared to document replication, there will be on average increase in amount of time the replica shards are consistent with primaries.

3. System/hidden indices support

With opensearch-project/OpenSearch#8200, system and hidden indices are now supported with SEGMENT replication strategy. We need to ensure there are no bottlenecks which prevents system/hidden indices with segment replication.

Next steps

With segment replication strong reads are not guaranteed. Thus, if the plugin needs strong reads guarantees specially as alternative to change in behavior of refresh policy and lag on replicas (point 1 and 2 above), we need to update search requests to target primary shard only. With opensearch-project/OpenSearch#7375, core now supports primary shards only based search. Please follow documentation for examples and details

Open questions

In case of any questions or issues, please post it in core issue

Reference

[1] Design

[2] Documentation

The text was updated successfully, but these errors were encountered:

dreamer-89 · 2023-06-29T19:51:18Z

Request owners to add v2.9.0 label on this issue.

eirsep · 2023-07-05T18:14:53Z

IMO the following changes will entail for alerting plugin to be compatible with segrep

we will have to change usages of RefreshPolicy.IMMEDIATE everywhere to RefreshPolicy.WAIT_UNTIL while storing alerts, findings, monitors, workflows as we need RAW guarantee.
we need to verify impact of "replica refresh lag" on Alerting service SLAs.

Tagging @lezzago to review

dreamer-89 · 2023-07-10T23:33:24Z

Hi Plugin Owners,
Gentle reminder to look into this issue as code freeze date for 2.9.0 release is near i.e. July 11th.

getsaurabh02 · 2023-07-11T22:02:33Z

Discussed this further with @dreamer-89 . Currently the system/hidden indices used by the plugin, such as to maintain the states - Alerting Config, Findings, Alerts etc, will continue to work as is in the release 2.9 as we are not onboarding them to start using the SEGMENT replication strategy. Before the 2.10 once it is enforced we will have to:

Ensure that the plugin internal system/hidden indices are compatible with segment replication feature. Ideally the delay between the updates and reads on these indices are equivalent to time between two execution cycles. Since the refresh lag (point 2) is expected to be in the order of few seconds, while alerting monitors are usually scheduled at minutely frequency or larger, we do not expect race conditions theoretically.
The data indices if seg rep enabled might interfere with the read consistencies again due to refresh lag between primaries and replica point 2. Reading from primaries always will be an in-effiencet solution from Alerting standpoint. However, we can again rely on the fact that delay between the updates and subsequent reads will be longer than the refresh delays since alerting monitors are scheduled at a minutely frequency at the most.
With 2.10 we can add a few integration tests with indices having the reg rep settings explicitly enabled to build a larger degree of confidence.

lezzago · 2023-08-22T15:17:10Z

we will have to change usages of RefreshPolicy.IMMEDIATE everywhere to RefreshPolicy.WAIT_UNTIL while storing alerts, findings, monitors, workflows as we need RAW guarantee.

I dont think so since it is stated that RefreshPolicy.WAIT_UNTIL will no longer have RAW guarantees.

I believe the only data that we write and immediately try to read are docs in the doc_level_queries index for document level alerting. We would need to modify the code to read that index to only fetch primary shards to ensure the guarantee.

dreamer-89 · 2023-08-22T19:52:54Z

1. Refresh policy behavior

RefreshPolicy.IMMEDIATE will only refresh primary shards but not replica shards immediately. Instead post refresh, primary will start a round of segment replication to update the replica shard copies leading to eventual consistency.

RefreshPolicy.WAIT_UNTIL ensures the indexing operation is searchable in your cluster i.e. RAW (Read after write guarantee). With segment replication, this guarantee is not promised due to delay in replica shared updates from asynchronous background refreshes.

Thanks @lezzago for the update. Yes, using _primary routing preference is one option to have RAW guarantees.
The other option is to use get/mget APIs which by default provides real-time reads. Core recently added supports for segment replication enabled indices with opensearch-project/OpenSearch#8536. Please have a look for more details.

lezzago · 2023-08-28T22:54:57Z

@dreamer-89, I have noticed one potential issue for customers by enabling segment replication. If the customer is using percolate queries, they need to index their queries before running the percolate query. By design, that will require a strong read on the recently indexed queries. Has core made changes to ensure that it could handle this use case and just query on the primary shards of the indexed queries?

lezzago · 2023-08-30T23:35:06Z

Closing as once core makes the necessary changes to handle seg rep, the alerting plugin will be fine.

lezzago · 2023-09-01T18:29:29Z

Reopening until opensearch-project/OpenSearch#9669 is resolved.

dreamer-89 · 2023-09-01T19:25:41Z

Reopening until opensearch-project/OpenSearch#9669 is resolved.

@lezzago : Can you please share more details why percolate queries may not work as intended with segment replication feature on core issue opensearch-project/OpenSearch#9669 ?

lezzago · 2023-09-01T22:06:58Z

In the plugin, we use percolate queries for our Document Level Monitor. Whenever the monitor is run, we update an Alerting index with the updated queries and schema mappings and have a refresh immediate policy set for it. Then we run a percolate query search with the Alerting index as the query index store and query the data that the monitor needs to search.

With the seg rep changes, the percolate query code inside OpenSearch core needs to ensure the query index store its searching is up to date by searching on the primary shards. If this doesn't happen, it would mean that the Document Level Monitor can potentially not fetch all the data and miss out on generating alerts when it should have. That could have big impacts for the customer and is very bad as this would be a silent error and they would not know about it.

Additionally, if Document Level monitors cannot ensure its fetching all the data, it would have big repercussions for the Security-Analytics plugin as they utilize Document Level monitors heavily for their detectors and they could miss security issues for the users of that plugin.

dreamer-89 added enhancement New feature or request untriaged labels Jun 29, 2023

dreamer-89 mentioned this issue Jun 29, 2023

[Meta] Validate plugins compatibility with segment replication opensearch-project/OpenSearch#8211

Closed

37 tasks

gaiksaya added the v2.9.0 v2.9.0 label Jul 3, 2023

eirsep removed the untriaged label Jul 3, 2023

getsaurabh02 assigned lezzago Aug 25, 2023

lezzago closed this as completed Aug 30, 2023

dreamer-89 mentioned this issue Aug 31, 2023

[BUG] [Segment Replication] Handle percolate queries opensearch-project/OpenSearch#9669

Closed

kaituo mentioned this issue Aug 31, 2023

Compatibility with segment replication opensearch-project/anomaly-detection#989

Closed

lezzago reopened this Sep 1, 2023

github-actions bot added the untriaged label Sep 1, 2023

lezzago removed the untriaged label Sep 1, 2023

eirsep mentioned this issue Sep 5, 2023

Changes in workflows for seg rep compatibility #1114

Merged

dreamer-89 mentioned this issue Sep 5, 2023

[DOC] Alerting plugin not compatible with segment replication feature opensearch-project/documentation-website#4967

Closed

4 tasks

lezzago mentioned this issue Sep 27, 2023

Add primary first calls for different monitor types #1205

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compatibility with segment replication #974

Compatibility with segment replication #974

dreamer-89 commented Jun 29, 2023 •

edited

Loading

dreamer-89 commented Jun 29, 2023

eirsep commented Jul 5, 2023

dreamer-89 commented Jul 10, 2023

getsaurabh02 commented Jul 11, 2023

lezzago commented Aug 22, 2023

dreamer-89 commented Aug 22, 2023 •

edited

Loading

1. Refresh policy behavior

lezzago commented Aug 28, 2023

lezzago commented Aug 30, 2023

lezzago commented Sep 1, 2023

dreamer-89 commented Sep 1, 2023

lezzago commented Sep 1, 2023

Compatibility with segment replication #974

Compatibility with segment replication #974

Comments

dreamer-89 commented Jun 29, 2023 • edited Loading

Summary

What changed

1. Refresh policy behavior

2. Refresh lag on replicas

3. System/hidden indices support

Next steps

Open questions

Reference

dreamer-89 commented Jun 29, 2023

eirsep commented Jul 5, 2023

dreamer-89 commented Jul 10, 2023

getsaurabh02 commented Jul 11, 2023

lezzago commented Aug 22, 2023

dreamer-89 commented Aug 22, 2023 • edited Loading

1. Refresh policy behavior

lezzago commented Aug 28, 2023

lezzago commented Aug 30, 2023

lezzago commented Sep 1, 2023

dreamer-89 commented Sep 1, 2023

lezzago commented Sep 1, 2023

dreamer-89 commented Jun 29, 2023 •

edited

Loading

dreamer-89 commented Aug 22, 2023 •

edited

Loading