[Fetch Migration] Handle idle pipeline when target doc count is never reached #377

kartg · 2023-11-01T17:27:52Z

Description

This commit introduces changes to track various Data Prepper metrics (as well as API failure counters) in order to detect an idle pipeline or non-responsive Data Prepper subprocess. A new ProgressMetrics class has been added to track these metrics and encapsulate detection logic. Much of the migration-success logic from the monitoring module has been moved to this class now. Unit test updates and improvements are also included.

This PR also includes a second commit which refactors/merges the run and monitor_local functions together (since most of their code/logic is identical) for improved unit test coverage.

Category
- Enhancement, bug fix, Refactoring
Why these changes are required?
- Without these logic changes, the monitoring module would only shutdown the Data Prepper pipeline when the target doc count was reached. If this failed to occur for any reason, or if the Data Prepper API was unresponsive, the overall Fetch Migration workflow would never conclude.

Testing

$ python -m coverage report --omit "*/tests/*"
Name                           Stmts   Miss  Cover
--------------------------------------------------
endpoint_info.py                   6      0   100%
fetch_orchestrator.py             25      0   100%
index_operations.py               37      0   100%
metadata_migration.py            123      0   100%
metadata_migration_params.py       7      0   100%
metadata_migration_result.py       5      0   100%
migration_monitor.py              94      4    96%
migration_monitor_params.py        6      0   100%
progress_metrics.py               89      0   100%
utils.py                          13      0   100%
--------------------------------------------------
TOTAL                            405      4    99%

Check List

New functionality includes testing
- All tests pass, including unit test, integration test and doctest
New functionality has been documented
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

codecov · 2023-11-01T17:32:35Z

Codecov Report

Merging #377 (21b3717) into main (e5ed0eb) will not change coverage.
The diff coverage is n/a.

@@            Coverage Diff            @@
##               main     #377   +/-   ##
=========================================
  Coverage     63.55%   63.55%           
  Complexity      715      715           
=========================================
  Files            82       82           
  Lines          3298     3298           
  Branches        303      303           
=========================================
  Hits           2096     2096           
  Misses         1014     1014           
  Partials        188      188

Flag	Coverage Δ
unittests	`63.55% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

FetchMigration/python/progress_metrics.py

mikaylathompson · 2023-11-06T19:12:04Z

FetchMigration/python/progress_metrics.py

+    idle_threshold: int
+    current_values_map: dict[str, Optional[int]]
+    prev_values_map: dict[str, Optional[int]]
+    counter_map: dict[str, int]


Minor, but I'm not sure I get why we're using a single dict with key prefixes to track idle values and failures--does it make more sense to keep them separately?

This also isn't a blocker for me, but goign forward, I think it's little cleaner to use something like named tuples or dataclasses for these dictionaries. There's (I believe) a fixed set of possible values for each of these, so it seems clearer to a reader + less error prone to use defined lists of possible values.

Minor, but I'm not sure I get why we're using a single dict with key prefixes to track idle values and failures--does it make more sense to keep them separately?

Yes, these could be stored in separate structures In my mind, they're both counter values - one for the number of times a metric value is idle, and the other for the number of failures - but i can see how this conflation makes the code harder to read.

I'll look into incorporating this, as well as your feedback on stronger control of possible values, in a follow-up PR.

FetchMigration/python/progress_metrics.py

mikaylathompson · 2023-11-06T19:19:05Z

FetchMigration/python/migration_monitor.py

+__DOC_SUCCESS_METRIC: str = "_opensearch_documentsSuccess"
+__RECORDS_IN_FLIGHT_METRIC: str = "_BlockingBuffer_recordsInFlight"
+__NO_PARTITIONS_METRIC: str = "_noPartitionsAcquired"
+__IDLE_THRESHOLD: int = 5


I see how this is used in ProgressMetrics, but I don't have much understanding on where this value comes from and why 5 is the right number. Can you elaborate?

5 was a completely random choice 😄 Data Prepper does not have the notion of an "idle" pipeline so this threshold is completely up to us. With a default polling interval of 30 seconds, I didn't want the monitor to leave the pipeline running for too long, or close it too quickly. So I picked 5 iterations (i.e. 2.5 minutes) as a reasonable (IMO) threshold

mikaylathompson · 2023-11-06T19:23:41Z

FetchMigration/python/migration_monitor.py

+        # Reset API failure counter
+        progress.reset_metric_api_failure()
        success_docs = get_metric_value(metrics, __DOC_SUCCESS_METRIC)
-        rec_in_flight = get_metric_value(metrics, __RECORDS_IN_FLIGHT_METRIC)
-        no_partitions_count = get_metric_value(metrics, __NO_PARTITIONS_METRIC)
-        if success_docs is not None:  # pragma no cover
-            completion_percentage: int = math.floor((success_docs * 100) / target_doc_count)
+        progress.update_records_in_flight_count(get_metric_value(metrics, __RECORDS_IN_FLIGHT_METRIC))
+        progress.update_no_partitions_count(get_metric_value(metrics, __NO_PARTITIONS_METRIC))
+        if success_docs is not None:
+            completion_percentage = progress.update_success_doc_count(success_docs)
            progress_message: str = "Completed " + str(success_docs) + \
                                    " docs ( " + str(completion_percentage) + "% )"
            logging.info(progress_message)
+            if progress.all_docs_migrated():
+                logging.info("All documents migrated...")
        else:
+            progress.record_success_doc_value_failure()
            logging.warning("Could not fetch progress stats from Data Prepper response, " +
                            "will retry on next polling cycle...")


I think there's still more logic here than I would expect, given ProgressMetrics. Does it make sense for ProgressMetrics to have an update (or whatever) function that accepts a metrics object, so this function can be stripped down to basically:

metrics = metrics = fetch_prometheus_metrics(endpoint_info) progress.update(metrics) if progress.all_docs_migrated(): # do whatever else: # do other stuff.

I think that's a reasonable ask, but i'd like to defer that change to the point where we have a better mechanism to surface metrics. I intentionally kept all of the logging out of ProgressMetrics so it could function like a pure dataclass.

FetchMigration/python/migration_monitor.py

mikaylathompson · 2023-11-06T19:26:47Z

FetchMigration/python/migration_monitor.py

+        else:
+            # Thread sleep
+            time.sleep(poll_interval_seconds)
+        if dp_process is None or is_process_alive(dp_process):


Why might dp_process be None? It seems surprising that woudl be handled in the same case as is_process_alive == True

dp_process being None means the Data Prepper process is not a subprocess - this allows the migration monitor module to be used against remote Data Prepper processes as well. While the Fetch Migration solution today only uses a local subprocess, this flexibility allows the migration monitor to be used in other scenarios.

Here, dp_process is None functions as short-circuit logic for the is_process_alive check - which shouldn't be run unless the Data Prepper process is local.

The comment preceding the run method was incorrect, so I updated this

…ete migration This includes a new ProgressMetrics class that is used by the migration monitor to track various Data Prepper and API failure metrics in order to detect an idle pipeline. Much of the migration-success logic from the monitoring module has now been encapsulated in this class. Unit test updates and improvements are also included. Signed-off-by: Kartik Ganesh <[email protected]>

Run and monitor_local have been merged into a single function since most of their code/logic is identical. Unit tests have been updated for improved coverage. Signed-off-by: Kartik Ganesh <[email protected]>

Signed-off-by: Kartik Ganesh <[email protected]>

kartg requested review from chelma, gregschohn, lewijacn, mikaylathompson, okhasawn and sumobrian as code owners November 1, 2023 17:27

kartg force-pushed the fetch-migration-non-terminal branch 4 times, most recently from 5beb8a4 to 686bb94 Compare November 5, 2023 20:57

mikaylathompson reviewed Nov 6, 2023

View reviewed changes

kartg added 3 commits November 6, 2023 18:52

[Fetch Migration] Refactoring migration_monitor.py functions

f057202

Run and monitor_local have been merged into a single function since most of their code/logic is identical. Unit tests have been updated for improved coverage. Signed-off-by: Kartik Ganesh <[email protected]>

Incorporated PR comments

21b3717

Signed-off-by: Kartik Ganesh <[email protected]>

kartg force-pushed the fetch-migration-non-terminal branch from 5eecac5 to 21b3717 Compare November 7, 2023 02:53

mikaylathompson self-requested a review November 7, 2023 16:44

mikaylathompson approved these changes Nov 7, 2023

View reviewed changes

kartg merged commit 400b236 into opensearch-project:main Nov 7, 2023
8 checks passed

kartg deleted the fetch-migration-non-terminal branch November 7, 2023 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fetch Migration] Handle idle pipeline when target doc count is never reached #377

[Fetch Migration] Handle idle pipeline when target doc count is never reached #377

kartg commented Nov 1, 2023 •

edited

Loading

codecov bot commented Nov 1, 2023 •

edited

Loading

mikaylathompson Nov 6, 2023

mikaylathompson Nov 6, 2023

kartg Nov 7, 2023

mikaylathompson Nov 6, 2023

kartg Nov 7, 2023

mikaylathompson Nov 6, 2023

kartg Nov 7, 2023

mikaylathompson Nov 6, 2023

kartg Nov 7, 2023

[Fetch Migration] Handle idle pipeline when target doc count is never reached #377

[Fetch Migration] Handle idle pipeline when target doc count is never reached #377

Conversation

kartg commented Nov 1, 2023 • edited Loading

Description

Testing

Check List

codecov bot commented Nov 1, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kartg commented Nov 1, 2023 •

edited

Loading

codecov bot commented Nov 1, 2023 •

edited

Loading