Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fetch Migration] Improvements to subprocess handling #372

Merged
merged 4 commits into from
Oct 31, 2023

Conversation

kartg
Copy link
Member

@kartg kartg commented Oct 26, 2023

Description

The migration_monitor.py module now includes a separate function (monitor_local) to monitor a Data Prepper subprocess.

Previously, the migration monitor only interacted with the Data Prepper process via API calls. This has the limitation that the monitor cannot differentiate between the subprocess intermittently not responding (or not yet started) versus a non-running subprocess. Thus, the API calls simply retry indefinitely if no response is received - which resulted in a indefinite polling loop if the subprocess had crashed or ended (say due to a bad configuration, OOM error, etc.)

The new function adds logic to poll the running subprocess and its return code (via Python's Popen object) - this is in addition to the monitoring via API calls. Unit tests for these use-cases have been added/updated.

This PR also includes an additional commit which adds a fix to index_operations.py to filter out system indices (so they are excluded from metadata migration). Short-circuit logic for empty source clusters has been added, along with related unit tests.

Testing

$ python -m coverage report --omit "*/tests/*"
Name                           Stmts   Miss  Cover
--------------------------------------------------
endpoint_info.py                   6      0   100%
fetch_orchestrator.py             25      0   100%
index_operations.py               37      0   100%
metadata_migration.py            125      1    99%
metadata_migration_params.py       7      0   100%
metadata_migration_result.py       5      0   100%
migration_monitor.py              96      3    97%
migration_monitor_params.py        5      0   100%
utils.py                          13      0   100%
--------------------------------------------------
TOTAL                            319      4    99%

Check List

  • New functionality includes testing
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@codecov
Copy link

codecov bot commented Oct 26, 2023

Codecov Report

Merging #372 (6af8de7) into main (4abe3ba) will decrease coverage by 0.03%.
Report is 6 commits behind head on main.
The diff coverage is 52.94%.

@@             Coverage Diff              @@
##               main     #372      +/-   ##
============================================
- Coverage     66.46%   66.44%   -0.03%     
+ Complexity      731      727       -4     
============================================
  Files            81       80       -1     
  Lines          3170     3165       -5     
  Branches        286      286              
============================================
- Hits           2107     2103       -4     
+ Misses          879      878       -1     
  Partials        184      184              
Flag Coverage Δ
unittests 66.44% <52.94%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...org/opensearch/migrations/replay/Accumulation.java 82.85% <100.00%> (ø)
.../migrations/replay/datatypes/UniqueRequestKey.java 100.00% <100.00%> (ø)
...ch/migrations/replay/SourceTargetCaptureTuple.java 0.00% <0.00%> (ø)
.../opensearch/migrations/replay/TrafficReplayer.java 21.79% <0.00%> (+0.14%) ⬆️

Copy link
Collaborator

@gregschohn gregschohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added comments to the code as best as I could follow it.
Some documentation in the main files about what you're doing, what the dependencies are doing, and what the responsibilities are for everything would help make a review from me more valuable. I don't know if are issues with the current patch because I can't fully assimilate the starting case yet.

FetchMigration/python/fetch_orchestrator.py Outdated Show resolved Hide resolved
FetchMigration/python/metadata_migration.py Show resolved Hide resolved
FetchMigration/python/migration_monitor.py Outdated Show resolved Hide resolved
FetchMigration/python/migration_monitor.py Outdated Show resolved Hide resolved
FetchMigration/python/migration_monitor.py Outdated Show resolved Hide resolved
FetchMigration/python/migration_monitor.py Show resolved Hide resolved
FetchMigration/python/migration_monitor.py Outdated Show resolved Hide resolved
FetchMigration/python/migration_monitor.py Show resolved Hide resolved

__DP_EXECUTABLE_SUFFIX = "/bin/data-prepper"
__PIPELINE_OUTPUT_FILE_SUFFIX = "/pipelines/pipeline.yaml"


def run(dp_base_path: str, dp_config_file: str, dp_endpoint: str):
def run(dp_base_path: str, dp_config_file: str, dp_endpoint: str) -> Optional[int]:
dp_exec_path = dp_base_path + __DP_EXECUTABLE_SUFFIX
output_file = dp_base_path + __PIPELINE_OUTPUT_FILE_SUFFIX
metadata_migration_params = MetadataMigrationParams(dp_config_file, output_file, report=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can your code confirm that DataPrepper is bound to only localhost? I could see a lot of difficulty happening if a second managing agent came along and started acting on it - or if somebody else DDOS'ed the API (while it was causing a lot of enery on the source cluster).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this code needs an update to stop accepting dp_endpoint. Currently the localhost configuration is driven only by the Dockerfile:

ENTRYPOINT python3 -u ./fetch_orchestrator.py $DATA_PREPPER_PATH $FM_CODE_PATH/input.yaml http://localhost:4900

This is obviously not a strong check. I'll make that change in a follow-up PR.

FetchMigration/python/migration_monitor_counters.py Outdated Show resolved Hide resolved
Also includes a minor optimization to short-circuit logic when there are no source indices found, along with a unit test.

Signed-off-by: Kartik Ganesh <[email protected]>
The migration_monitor module now includes a separate method to monitor a Data Prepper subprocess. In addition to interacting/monitoring via Data Prepper API calls (which is the existing implementation) the new, "local" monitoring method adds logic to poll the running subprocess and its return code. Unit tests for these use-cases have been added/updated.

Signed-off-by: Kartik Ganesh <[email protected]>
This enables quitting the Fetch Migration workflow with a return_code bubbled up from migration_monitor.py, which can help indicate a successful (return code zero) or non-successful execution.

Signed-off-by: Kartik Ganesh <[email protected]>
Signed-off-by: Kartik Ganesh <[email protected]>
Copy link
Collaborator

@sumobrian sumobrian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@kartg
Copy link
Member Author

kartg commented Oct 31, 2023

CodeCov failures are looking at Java code, which this PR doesn't touch. I'll go ahead and merge this.

@kartg kartg merged commit 135fedf into opensearch-project:main Oct 31, 2023
6 of 8 checks passed
@kartg kartg deleted the fetch-migration-updates branch October 31, 2023 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants