Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fetch Migration] Enable migration for identical, empty target cluster index #390

Merged
merged 7 commits into from
Nov 11, 2023

Conversation

kartg
Copy link
Member

@kartg kartg commented Nov 6, 2023

Description

  • Category: Enhancement

Why these changes are required?

This change enables Fetch Migration to migrate data for indices that are identical between the source and target cluster, and the document count on the target cluster is zero. Such behavior allow users to execute Fetch Migration in an idempotent manner without having to manually remove empty indices on the target cluster (that were created by a previous run of Fetch Migration that did not complete successfully)

What is the old behavior before changes and new behavior after changes?

Prior to this change, all indices that were identical between the source and target clusters were ineligible for data migration. Now, we check the doc_count for identical indices on the target cluster and only exclude identical indices with a non-zero document count.

Testing

$ python -m coverage report --omit "*/tests/*"
Name                           Stmts   Miss  Cover
--------------------------------------------------
endpoint_info.py                   6      0   100%
fetch_orchestrator.py             25      0   100%
index_diff.py                     20      0   100%
index_doc_count.py                 5      0   100%
index_operations.py               49      0   100%
metadata_migration.py            118      0   100%
metadata_migration_params.py       7      0   100%
metadata_migration_result.py       5      0   100%
migration_monitor.py              94      4    96%
migration_monitor_params.py        6      0   100%
progress_metrics.py               89      0   100%
utils.py                          13      0   100%
--------------------------------------------------
TOTAL                            437      4    99%

Also tested end-to-end via the CDK. Log output from a test run:

INFO:root:Running pre-migration steps...

Identical indices in the target cluster: [logs-241998, sonested, logs-181998, logs-221998, reindexed-logs, nyc_taxis, geonames, logs-191998, logs-201998, logs-231998, logs-211998]

Identical empty indices in the target cluster (data will be migrated): [sonested, reindexed-logs, nyc_taxis, geonames]

Indices present in both clusters with conflicting settings/mappings (data will not be migrated): []

Indices to be created in the target cluster (data will be migrated): []

Total number of documents to be moved: 3000

...

Check List

  • New functionality includes testing
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link

codecov bot commented Nov 6, 2023

Codecov Report

Merging #390 (d327ab5) into main (4497873) will increase coverage by 0.09%.
The diff coverage is n/a.

@@             Coverage Diff              @@
##               main     #390      +/-   ##
============================================
+ Coverage     63.55%   63.64%   +0.09%     
- Complexity      715      718       +3     
============================================
  Files            82       82              
  Lines          3298     3298              
  Branches        303      303              
============================================
+ Hits           2096     2099       +3     
+ Misses         1014     1011       -3     
  Partials        188      188              
Flag Coverage Δ
unittests 63.64% <ø> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 2 files with indirect coverage changes

@kartg kartg force-pushed the fetch-doc-count-improvements branch 3 times, most recently from d968a25 to 29b6d8f Compare November 8, 2023 18:18
FetchMigration/python/metadata_migration.py Outdated Show resolved Hide resolved
FetchMigration/python/index_diff.py Outdated Show resolved Hide resolved
resp = requests.get(actual_endpoint, auth=endpoint.auth, verify=endpoint.verify_ssl)
def doc_count(indices: set, endpoint: EndpointInfo) -> IndexDocCount:
actual_endpoint = endpoint.url + ','.join(indices) + __SEARCH_COUNT_PATH
resp = requests.get(actual_endpoint, auth=endpoint.auth, verify=endpoint.verify_ssl, json=__SEARCH_COUNT_PAYLOAD)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be wrapped in an exception block printing errors or warnings when there is a connection problem. Additionally, status codes should be checked.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got an open issue tracking such fixes across the implementation

sink:
- opensearch:
bulk_size: 10
# DO NOT CHANGE the hosts value - this will be updated by the CDK
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as comment above: I would add a note that they would need to change these only if running locally (in Docker)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the most recent updates to the Fetch Migration CDK, the hosts value in the sink will be replaced by an inline target host value through the CDK, so changing this value will not affect that logic nor reflect in the final setup. I've updated the comment here to:

# Note - CDK code will replace this value with the target cluster endpoint

This is in contrast to the hosts value in source where the string value matters because the CDK performs a string substitution to inject the source cluster endpoint.

@kartg kartg force-pushed the fetch-doc-count-improvements branch from c9fc513 to fd9edd2 Compare November 10, 2023 19:40
This is encapsulated in an IndexDocCount class, which also track the total doc count

Signed-off-by: Kartik Ganesh <[email protected]>
…cluster indices

This change includes refactoring the index difference logic to an IndexDiff class, and an additional network call to the target cluster to fetch document counts for identical indices to determine which ones are empty.

Signed-off-by: Kartik Ganesh <[email protected]>
This change adds documentation to the dp_pipeline_template.yaml file and changes some reasonable defaults. It also bumps the Fetch Migration task definition to 1 CPu and 4 GB memory.

Signed-off-by: Kartik Ganesh <[email protected]>
Signed-off-by: Kartik Ganesh <[email protected]>
@kartg kartg force-pushed the fetch-doc-count-improvements branch from fd9edd2 to 54bead7 Compare November 10, 2023 20:26
@kartg kartg merged commit fb0553a into opensearch-project:main Nov 11, 2023
8 checks passed
@kartg kartg deleted the fetch-doc-count-improvements branch November 11, 2023 01:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants