Skip to content

Commit

Permalink
Documentation revamp + Fetch Migration (#356)
Browse files Browse the repository at this point in the history
This PR cleans up some stale README file and includes instructions for Fetch Migration into the documentation. It also adds a detailed, step-by-step guide on deploying to AWS using CDK and Copilot.

Some other minor updates included in this commit:

* Since Data Prepper 2.5.0 with the OpenSearch source plugin has now been released, the Fetch Migration Dockerfile now uses this as the base image rather than a snapshot
* The Python dependency versions for Fetch Migration have been updated
* Updated the runTestBenchmarks.sh script to allow a no-ssl flag
* Increased the default CPU and Memory specification for the Fetch Migration container
* Minor bugfixes in fetch_orchestrator.py:
- Fixing an error in the ArgParse name for the Data Prepper endpoint
- Updating the mechanism to fetch the INLINE_PIPELINE environment variable to avoid KeyError.

---------

Signed-off-by: Kartik Ganesh <[email protected]>
  • Loading branch information
kartg authored Oct 20, 2023
1 parent 1282cce commit 2ec00c0
Show file tree
Hide file tree
Showing 11 changed files with 341 additions and 323 deletions.
12 changes: 5 additions & 7 deletions FetchMigration/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,12 +1,10 @@
# TODO Move away from snapshot version after OS source is released
# https://github.com/opensearch-project/data-prepper/issues/1985
FROM opensearch-data-prepper:2.5.0-SNAPSHOT
FROM opensearchproject/data-prepper:2.5.0
COPY python/requirements.txt .

# Install dependencies to local user directory
RUN apk update
RUN apk add --no-cache python3 py-pip
RUN pip install --user -r requirements.txt
RUN apt -y update
RUN apt -y install python3 python3-pip
RUN pip3 install --user -r requirements.txt

ENV FM_CODE_PATH /code
WORKDIR $FM_CODE_PATH
Expand All @@ -21,4 +19,4 @@ RUN echo "ssl: false" > $DATA_PREPPER_PATH/config/data-prepper-config.yaml
RUN echo "metricRegistries: [Prometheus]" >> $DATA_PREPPER_PATH/config/data-prepper-config.yaml

# Include the -u flag to have stdout logged
ENTRYPOINT python -u ./fetch_orchestrator.py $DATA_PREPPER_PATH $FM_CODE_PATH/input.yaml http://localhost:4900
ENTRYPOINT python3 -u ./fetch_orchestrator.py $DATA_PREPPER_PATH $FM_CODE_PATH/input.yaml http://localhost:4900
91 changes: 45 additions & 46 deletions FetchMigration/README.md
Original file line number Diff line number Diff line change
@@ -1,43 +1,38 @@
# Index Configuration Tool
# "Fetch" Data Migration / Backfill

Python package that automates the creation of indices on a target cluster based on the contents of a source cluster.
Index settings and index mappings are correctly copied over, but no data is transferred.
This tool seeks to eliminate the need to [specify index templates](https://github.com/awslabs/logstash-output-amazon_es#optional-parameters) when migrating data from one cluster to another.
The tool currently supports ElasticSearch or OpenSearch as source and target.
Fetch Migration provides an easy-to-use tool that simplifies the process of moving indices and their data from a
"source" cluster (either Elasticsearch or OpenSearch) to a "target" OpenSearch cluster. It automates the process of
comparing indices between the two clusters and only creates index metadata (settings and mappings) that do not already
exist on the target cluster. Internally, the tool uses [Data Prepper](https://github.com/opensearch-project/data-prepper)
to migrate data for these created indices.

## Parameters
The Fetch Migration tool is implemented in Python.
A Docker image can be built using the included [Dockerfile](./Dockerfile).

The first required input to the tool is a path to a [Data Prepper](https://github.com/opensearch-project/data-prepper) pipeline YAML file, which is parsed to obtain the source and target cluster endpoints.
The second required input is an output path to which a modified version of the pipeline YAML file is written.
This version of the pipeline adds an index inclusion configuration to the sink, specifying only those indices that were created by the index configuration tool.
The tool also supports several optional flags:
## Components

| Flag | Purpose |
| ------------- | ------------- |
| `-h, --help` | Prints help text and exits |
| `--report, -r` | Prints a report of indices indicating which ones will be created, along with indices that are identical or have conflicting settings/mappings. |
| `--dryrun` | Skips the actual creation of indices on the target cluster |
The tool consists of 3 components:
* A "metadata migration" module that handles metadata comparison between the source and target clusters.
This can output a human-readable report as well as a Data Prepper pipeline `yaml` file.
* A "migration monitor" module that monitors the progress of the migration and shuts down the Data Prepper pipeline
once the target document count has been reached
* An "orchestrator" module that sequences these steps as a workflow and manages the kick-off of the Data Prepper
process between them.

### Reporting

If `--report` is specified, the tool prints all processed indices organized into 3 buckets:
* Successfully created on the target cluster
* Skipped due to a conflict in settings/mappings
* Skipped since the index configuration is identical on source and target
The orchestrator module is the Docker entrypoint for the tool, though each component can be executed separately
via Python. Help text for each module can be printed by supplying the `-h / --help` flag.

## Current Limitations

* Only supports ElasticSearch and OpenSearch endpoints for source and target
* Only supports basic auth
* Type mappings for legacy indices are not handled
* Index templates and index aliases are not copied
* Index health is not validated after creation

## Usage
* Fetch Migration runs as a single instance and does not support vertical scaling or data slicing
* The tool does not support customizing the list of indices included for migration
* Metadata migration only supports basic auth
* The migration does not filter out `red` indices
* In the event that the migration fails or the process dies, the created indices on the target cluster are not rolled back

### Command-Line
## Execution

#### Setup:
### Python

* [Clone](https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository) this GitHub repo
* Install [Python](https://www.python.org/)
Expand All @@ -47,15 +42,13 @@ If `--report` is specified, the tool prints all processed indices organized into
Navigate to the cloned GitHub repo. Then, install the required Python dependencies by running:

```shell
python -m pip install -r index_configuration_tool/requirements.txt
python -m pip install -r python/requirements.txt
```

#### Execution:

After [setup](#setup), the tool can be executed using:
The Fetch Migration workflow can then be kicked off via the orchestrator module:

```shell
python index_configuration_tool/metadata_migration.py <pipeline_yaml_path> <output_file>
python python/fetch_orchestrator.py --help
```

### Docker
Expand All @@ -67,42 +60,48 @@ docker build -t fetch-migration .
```

Then run the `fetch-migration` image.
Replace `<pipeline_yaml_path>` in the command below with the path to your Logstash config file:
Replace `<pipeline_yaml_path>` in the command below with the path to your Data Prepper pipeline `yaml` file:

```shell
docker run -p 4900:4900 -v <pipeline_yaml_path>:/code/input.yaml ict
docker run -p 4900:4900 -v <pipeline_yaml_path>:/code/input.yaml fetch-migration
```

### AWS deployment

Refer to [AWS Deployment](../deployment/README.md) to deploy this solution to AWS.

## Development

The source code for the tool is located under the `index_configuration_tool/` directory. Please refer to the [Setup](#setup) section to ensure that the necessary dependencies are installed prior to development.
The source code for the tool is located under the `python/` directory, with unit test in the `tests/` subdirectory.
Please refer to the [Setup](#setup) section to ensure that the necessary dependencies are installed prior to development.

Additionally, you'll also need to install development dependencies by running:

```shell
python -m pip install -r index_configuration_tool/dev-requirements.txt
python -m pip install -r python/dev-requirements.txt
```

### Unit Tests

Unit tests are located in a sub-directory named `tests`. Unit tests can be run using:
Unit tests can be run from the `python/` directory using:

```shell
python -m unittest
python -m coverage run -m unittest
```

### Coverage

Code coverage metrics can be generated by first running unit tests using _coverage run_:
_Code coverage_ metrics can be generated after a unit-test run. A report can either be printed on the command line:

```shell
python -m coverage run -m unittest
python -m coverage report --omit "*/tests/*"
```

Then a report can either be printed on the command line or generated as HTML.
Note that the `--omit` parameter must be specified to avoid tracking code coverage on unit test code:
or generated as HTML:

```shell
python -m coverage report --omit "*/tests/*"
python -m coverage html --omit "*/tests/*"
```
```

Note that the `--omit` parameter must be specified to avoid tracking code coverage on unit test code itself.
3 changes: 2 additions & 1 deletion FetchMigration/python/dev-requirements.txt
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
coverage>=7.2.3
coverage>=7.3.2
pur>=7.3.1
7 changes: 4 additions & 3 deletions FetchMigration/python/fetch_orchestrator.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,9 @@ def run(dp_base_path: str, dp_config_file: str, dp_endpoint: str):
cli_args = arg_parser.parse_args()
base_path = os.path.expandvars(cli_args.data_prepper_path)

if os.environ["INLINE_PIPELINE"] is not None:
decoded_bytes = base64.b64decode(os.environ["INLINE_PIPELINE"])
inline_pipeline = os.environ.get("INLINE_PIPELINE", None)
if inline_pipeline is not None:
decoded_bytes = base64.b64decode(inline_pipeline)
with open(cli_args.config_file_path, 'wb') as config_file:
config_file.write(decoded_bytes)
run(base_path, cli_args.config_file_path, cli_args.dp_endpoint)
run(base_path, cli_args.config_file_path, cli_args.data_prepper_endpoint)
4 changes: 2 additions & 2 deletions FetchMigration/python/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
jsondiff>=2.0.0
prometheus-client>=0.17.1
pyyaml>=6.0
pyyaml>=6.0.1
requests>=2.31.0
responses>=0.23.1
responses>=0.23.3
8 changes: 7 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ A containerized end-to-end solution can be deployed locally using the

### AWS deployment

Refer to [AWS Deployment](deployment/copilot/README.md) to deploy this solution to AWS.
Refer to [AWS Deployment](deployment/README.md) to deploy this solution to AWS.

## Developer contributions

Expand All @@ -46,6 +46,12 @@ The TrafficCapture directory hosts a set of projects designed to facilitate the

More documentation on this directory including the projects within it can be found here: [Traffic Capture](TrafficCapture/README.md).

### Fetch Migration

The FetchMigration directory hosts tools that simplify the process of backfilling / moving data from one cluster to another.

Further documentation can be found here: [Fetch Migration README](FetchMigration/README.md).

### Running Tests

Developers can run a test script which will verify the end-to-end Local Docker Solution.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ endpoint="https://capture-proxy-es:9200"
auth_user="admin"
auth_pass="admin"
no_auth=false
no_ssl=false

# Override default values with optional command-line arguments
while [[ $# -gt 0 ]]; do
Expand All @@ -29,6 +30,10 @@ while [[ $# -gt 0 ]]; do
no_auth=true
shift
;;
--no-ssl)
no_ssl=true
shift
;;
*)
shift
;;
Expand All @@ -42,8 +47,13 @@ else
auth_string=",basic_auth_user:${auth_user},basic_auth_password:${auth_pass}"
fi

if [ "$no_ssl" = true ]; then
base_options_string=""
else
base_options_string="use_ssl:true,verify_certs:false"
fi

# Construct the final client options string
base_options_string="use_ssl:true,verify_certs:false"
client_options="${base_options_string}${auth_string}"

echo "Running opensearch-benchmark workloads against ${endpoint}"
Expand Down
Loading

0 comments on commit 2ec00c0

Please sign in to comment.