Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Index Configuration Tool] Unified Dockerfile for ICT and Data Prepper #234

Merged
merged 2 commits into from
Jul 19, 2023

Conversation

kartg
Copy link
Member

@kartg kartg commented Jul 7, 2023

Description

This change updates the Dockerfile definition for index_configuration_tool (ICT) to use Data Prepper as the base image. This will allow us to run (as a single Docker image) the index_configuration_tool Python logic to configure indices on the target cluster before kicking off Data Prepper execution to move data over. Note that since these steps must occur in sequence, the child Docker image overrides the parent's CMD definition and must manually invoke the data-prepper executable.

This is a POC implementation, so there are some limitations:

  • The data migration pipeline does not shut down automatically once all data has been migrated. This is pending platform support in Data Prepper - [Feature Request] Allow the OpenSearch source plugin to shut down the Data Prepper pipeline data-prepper#2944
  • If a _bulk call fails during data migration, the index-creation steps performed by ICT are not rolled back. These must be manually removed before running the Docker image again. This is necessary because ICT errs on the side of caution - the current logic does not validate doc_count for identical indices so we may end up incorrectly overwriting documents on the target cluster. Even if the doc_count is identical, it would require a much deeper check to verify if the contents of the index are identical.
  • If there are no indices to migrate, the Docker container should ideally just shut down without running Data Prepper. Currently, Data Prepper is still kicked off but fails with an ugly stack-trace. This is because the Dockerfile simply defines two commands in sequence.
  • There is scope to simplify the input mechanism for the unified Docker image. Today, it expects a Data Prepper YAML file, but this would require an understanding of the DP input format. Ideally, the input to the tool should be far simpler - possibly user-interactive and smartly deriving expected inputs based on endpoints.
  • Today, if an index is present on both source and target clusters but differs in settings/mappings, it is classified as a “conflict” and data is not migrated for it. However, it may be safe to migrate data that only differ in settings and not mappings, since these are unlikely to behave differently. We would still need to first incorporate a check for doc_count though (like with identical indices above)

Issues Resolved

relates #165 and #163

Testing

$ python -m coverage run -m unittest
.........................Wrote output YAML pipeline to: dummy
....
----------------------------------------------------------------------
Ran 29 tests in 0.043s

OK
$ python -m coverage report --omit "*/tests/*"
Name                  Stmts   Miss  Cover
-----------------------------------------
index_operations.py      37      2    95%
main.py                 112      0   100%
utils.py                 13      0   100%
-----------------------------------------
TOTAL                   162      2    99%

Check List

  • New functionality includes testing
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

# Copy only source code
COPY ./*.py .

# update PATH
ENV PATH=/root/.local:$PATH

# make sure you include the -u flag to have our stdout logged
ENTRYPOINT [ "python", "-u", "./main.py" ]
CMD python -u ./main.py -r $DATA_PREPPER_PATH/pipelines/pipelines.yaml $DATA_PREPPER_PATH/pipelines/pipelines.yaml; $DATA_PREPPER_PATH/bin/data-prepper
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of a me-and-docker question, but why switch from ENTRYPOINT to CMD?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated on this line:

If there are no indices to migrate, the Docker container should ideally just shut down without running Data Prepper. Today, Data Prepper would still run and incorrectly migrate all indices. This is because the Dockerfile simply defines two commands in sequence.

This seems to be largely because we're reading from and writing to the same YAML file, and then running with that YAML file. Would the situation be "safer" if we e.g. renamed pipelines.yaml to original-pipelines.yaml, used that as the source for ./main.py but output to pipelines.yaml and then ran data-prepper? Would that ensure that only the correct indices get migrated?

Incorrectly migrating indices (with no step for the user to manually intervene) seems like a fairly large bug to continue with.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of a me-and-docker question, but why switch from ENTRYPOINT to CMD?

Good question! I was using CMD because the base Data Prepper Dockerfile uses CMD, but it turns out ENTRYPOINT in the child image overrides the parent CMD. Updated.

This seems to be largely because we're reading from and writing to the same YAML file, and then running with that YAML file. Would the situation be "safer" if we e.g. renamed pipelines.yaml to original-pipelines.yaml, used that as the source for ./main.py but output to pipelines.yaml and then ran data-prepper? Would that ensure that only the correct indices get migrated?

Agree that this is a reasonable mitigation. I've changed the path for the input YAML file for ICT to not live in the data-prepper pipelines directory. While this mitigates this bug, the Docker image still runs the commands in sequence, which causes Data Prepper to start but fail because no pipeline configuration is found. I'm tracking a fix for this in https://opensearch.atlassian.net/browse/MIGRATIONS-1218

… tool and Data Prepper

This change updates the Dockerfile definition for index_configuration_tool to use Data Prepper as the base image. This will allow us to run the index_configuration_tool Python logic to configure indices on the target cluster before kicking off Data Prepper execution to move data over. Note that since these steps must occur in sequence, the child Docker image overrides the parent's CMD definition and must manually invoke the data-prepper executable.

Signed-off-by: Kartik Ganesh <[email protected]>
@codecov
Copy link

codecov bot commented Jul 19, 2023

Codecov Report

Merging #234 (c676432) into main (0e466cb) will not change coverage.
The diff coverage is n/a.

@@            Coverage Diff            @@
##               main     #234   +/-   ##
=========================================
  Coverage     54.69%   54.69%           
  Complexity      490      490           
=========================================
  Files            65       65           
  Lines          2565     2565           
  Branches        235      235           
=========================================
  Hits           1403     1403           
  Misses         1035     1035           
  Partials        127      127           
Flag Coverage Δ
unittests 54.69% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Signed-off-by: Kartik Ganesh <[email protected]>
@kartg kartg merged commit a7b7bcd into opensearch-project:main Jul 19, 2023
5 checks passed
@kartg kartg deleted the ict-single-Dockerfile branch July 19, 2023 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants