Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demonstrate use of ECR within an Airflow DAG #186

Open
LucaCinquini opened this issue Aug 14, 2024 · 23 comments
Open

Demonstrate use of ECR within an Airflow DAG #186

LucaCinquini opened this issue Aug 14, 2024 · 23 comments
Assignees
Labels

Comments

@LucaCinquini
Copy link
Collaborator

Using Docker images pulled from ECR would enable higher scalability when executing workload. We want to demonstrate how this is possible using a simple CWL workflow.

@LucaCinquini
Copy link
Collaborator Author

Suggested steps:
o Manually create ECR on unity-venue-dev

@nikki-t
Copy link
Collaborator

nikki-t commented Aug 14, 2024

It looks like this might be a good starting place for using containers with CWL: https://www.commonwl.org/user_guide/topics/using-containers.html

@LucaCinquini LucaCinquini moved this to In Progress in Unity Project Board Aug 14, 2024
@LucaCinquini
Copy link
Collaborator Author

For security reasons, we should store the ECR endpoint (which contains the account number) as an Airflow variable, then within the DAG retrieve it and pass it to echo_message.cwl as an additional parameter.

@mike-gangl
Copy link

Would it be possible to use an existing application package? I'd rather test with something real instead of something like hello world.

I can supply the existing docker containers and CWL for our example application, all we'd need to do is update the pointers to docker containers in the CWL to the ECR ones.

@LucaCinquini
Copy link
Collaborator Author

Let's do both... Simple "echo" first, application package after.

@nikki-t
Copy link
Collaborator

nikki-t commented Aug 19, 2024

I was able to implement this part way but the CWL does not allow input parameters in the dockerPull field. See this documentation for a list.

I am hoping that I can override the workflow requirements at load time but am not quite sure this will work.

@LucaCinquini
Copy link
Collaborator Author

@nikki-t : using override might work - but we would also have to change the script that runs the CWL workklow. Another possibility would be to update the setup task to:

  • Retrieve the URL of the YAML file
  • Replace the "budybox" docker image with the full ECR URL
  • Convert the YAML to JSON
  • Pass the JSON string as the second argument to the Docker container:
  • arguments=["{{ params.cwl_workflow }}", "{{ params.cwl_args }}"] : params_cwl_args can be either a URL, or a JSON string)

@nikki-t
Copy link
Collaborator

nikki-t commented Aug 19, 2024

@LucaCinquini - I was able to get the override to work without needing to change the script that runs the CWL workflow by converting the CWL argument YAML into JSON and including the override there.

@nikki-t
Copy link
Collaborator

nikki-t commented Aug 19, 2024

The CWL task cannot pull the private container image from ECR. The CWL task launches a pod and runs the SPS Docker CWL image. This gets a service account of airflow-worker which has an IAM role (AirflowWorkerPolicy) associated with it. This role has permissions to access ECR:

            "ecr:GetAuthorizationToken",
            "ecr:BatchCheckLayerAvailability",
            "ecr:GetDownloadUrlForLayer",
            "ecr:BatchGetImage",

But I get the following error:

[2024-08-19, 18:30:56 UTC] {pod_manager.py:468} INFO - [base] INFO /usr/share/cwl/venv/bin/cwl-runner 3.1.20240508115724 
[2024-08-19, 18:30:56 UTC] {pod_manager.py:468} INFO - [base] Error: No such object: xxx.dkr.ecr.us-west-2.amazonaws.com/unity-nikki-1-dev-sps-busybox 
[2024-08-19, 18:30:56 UTC] {pod_manager.py:468} INFO - [base] INFO ['docker', 'pull', xxx.dkr.ecr.us-west-2.amazonaws.com/unity-nikki-1-dev-sps-busybox']
[2024-08-19, 18:30:56 UTC] {pod_manager.py:468} INFO - [base] Using default tag: latest 
[2024-08-19, 18:30:56 UTC] {pod_manager.py:468} INFO - [base] Error response from daemon: Head "https://xxx.dkr.ecr.us-west-2.amazonaws.com/v2/unity-nikki-1-dev-sps-busybox/manifests/latest": no basic auth credentials 
[2024-08-19, 18:30:56 UTC] {pod_manager.py:468} INFO - [base] ERROR Workflow or tool uses unsupported feature: 
[2024-08-19, 18:30:57 UTC] {pod_manager.py:486} INFO - [base] Docker is required to run this tool: Command '['docker', 'pull', xxx.dkr.ecr.us-west-2.amazonaws.com/unity-nikki-1-dev-sps-busybox']' returned non-zero exit status 1.

I think that the task may need to log into the ECR repo but there isn't a clear way to have this occur prior to the docker pull. Perhaps we can log in, in the entrypoint script?

It looks like the CWL DockerRequirement field can build a docker image using dockerFile: https://www.commonwl.org/v1.2/CommandLineTool.html#DockerRequirement

@nikki-t
Copy link
Collaborator

nikki-t commented Aug 26, 2024

After reading documentation on IAM roles for service accounts and on the Amazon EKS Pod Identity Agent, it looks like the following steps are needed to connect service accounts to IAM roles and use the EKS Pod Identity Agent:

Steps needed for to connect a service account with an IAM role:

  1. Create an IAM OIDC provider for cluster.
  2. Set up Amazon EKS Pod Identity Agent.
    a. Node role has permissions to do the AssumeRoleForPodIdentity action in EKS Auth API.
  3. Assign an IAM role to Kubernetes service account.
    a. Role must have trust relationship.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::$account_id:oidc-provider/$oidc_provider"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "$oidc_provider:aud": "sts.amazonaws.com",
          "$oidc_provider:sub": "system:serviceaccount:$namespace:$service_account"
        }
      }
    }
  ]
}
  1. Annotate service account with IAM role.
  2. Create EKS pod identity association with an IAM role that allows service account to assume IAM role with EKS pod identity.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowEksAuthToAssumeRoleForPodIdentity",
            "Effect": "Allow",
            "Principal": {
                "Service": "pods.eks.amazonaws.com"
            },
            "Action": [
                "sts:AssumeRole",
                "sts:TagSession"
            ]
        }
    ]
}
  1. Configure pods to access AWS services with service account.
    a. Define serviceAccountName in spec (YAML).

After putting all of these together, I am still running into theno basic auth credentials error. I am going to pivot to trying to log into docker in the entrypoint script.

@LucaCinquini
Copy link
Collaborator Author

I agree... let's see if the other solution works, thanks.

@nikki-t
Copy link
Collaborator

nikki-t commented Aug 27, 2024

I was able to use an ECR URI in the execution of a CWL DAG. I just need to finalize how we store the ECR URI. It is currently stored as an Airflow Variable with a key of cwl_dag_ecr_uri which is what the cwl_dag_ecr.py script looks for in order to pass the URI to the CWL DAG. The end user will need to create a variable with this name in and assign it a value with the URI of their ECR container image in the Airflow UI. We could also pass it in as an input parameter to the DAG but then the user would need to enter that every time.

@LucaCinquini - What do you think is best?

@LucaCinquini
Copy link
Collaborator Author

@nikki-t : what about if we store the AWS ECR base name ([XXXXXXXXXX.dkr.ecr.us-west-2.amazonaws.com]) as an Airflow variable, so it can be used by multiple DAGs. Then the specific "cwl_dag_ecr.py" will combine that with the name and version of the CWL Docker image. Other DAGs might use ECR to execute different Docker images.

@nikki-t
Copy link
Collaborator

nikki-t commented Aug 27, 2024

@LucaCinquini - I think we would still need input parameters or Airflow variables for the name and version of the ECR container image. The full URI is:

<aws_account_id>.dkr.ecr.<aws_region>.amazonaws.com/<container_image_name>:<version>.

So the user would have to provide the <container_image_name> and <version> since that might be different for different executions of the DAG. But we can certainly store the AWS ECR base name as an Airflow variable which we might be able to define in the Helm chart for Airflow.

@LucaCinquini
Copy link
Collaborator Author

Sorry, I didn't express myself correctly. The "cwl_dag_ecr.py" should specify the name of this Docker image: ghcr.io/unity-sds/unity-sps/sps-docker-cwl-ecr:latest, which it already does. The name and tag of the algorithm image should be part of the specific CWL workflow, then pre-pended with the ECR base URI.
I think we should investigate what is CWL support for using Docker images from a private repository.

@nikki-t
Copy link
Collaborator

nikki-t commented Aug 27, 2024

Here is the documentation on using containers with CWL. CWL does not allow parameter references in the dockerPull key so the ECR URI needs to be built and passed to the CWL DAG definition in the cwl_dag_ecr.py script.

So I think it makes the most sense to have the user specify the ECR URI via Airflow variable or input parameter. And to keep things as simple as possible, we can store the ECR base URI (which is also the login URL) and then the user can pass in their container image name and version as input parameters.

@LucaCinquini
Copy link
Collaborator Author

Realized that the current solution won't work for nested workflows, exploring different options.

@LucaCinquini
Copy link
Collaborator Author

Using "cwltool --pack" would create a single CWL document from the top level and nested workflows, which can then be parsed and modified as needed. Unfortunately, this does not seem to work with some of our workflows that are stored in Dockstore.

For example, this command works fine:

cwltool --pack https://raw.githubusercontent.com/common-workflow-language/cwltool/main/tests/wf/nested.cwl

But packing one of our Dockstore workflows gives an error:

cwltool --debug --pack https://raw.githubusercontent.com/unity-sds/sbg-workflows/main/L1-to-L2-e2e.cwl

ERROR I'm sorry, I couldn't load this CWL file.
The error was:
Traceback (most recent call last):
File "/Users/cinquini/PycharmProjects/unity-sps/.venv/lib/python3.9/site-packages/cwltool/main.py", line 1114, in main
print(print_pack(loadingContext, uri), file=stdout)
File "/Users/cinquini/PycharmProjects/unity-sps/.venv/lib/python3.9/site-packages/cwltool/main.py", line 633, in print_pack
packed = pack(loadingContext, uri)
File "/Users/cinquini/PycharmProjects/unity-sps/.venv/lib/python3.9/site-packages/cwltool/pack.py", line 265, in pack
v = rewrite_inputs[r]
KeyError: 'http://awslbdockstorestack-lb-1429770210.us-west-2.elb.amazonaws.com:9998/api/ga4gh/trs/v2/tools/%23workflow%2Fdockstore.org%2Fmike-gangl%2FSBG-unity-fractional-cover/versions/2/PLAIN-CWL/descriptor/%2FDockstore.cwl#sbg-unity-fractional-cover'

Probably because of the "#" in the URL. Can that be avoided?

@LucaCinquini
Copy link
Collaborator Author

@nikki-t
Copy link
Collaborator

nikki-t commented Sep 3, 2024

@LucaCinquini - I see that the above is a little more complicated than I initially thought. I have run the cwltool --pack --debug command and it does seem to get snagged at the # in the name of this step to run: http://awslbdockstorestack-lb-1429770210.us-west-2.elb.amazonaws.com:9998/api/ga4gh/trs/v2/tools/%23workflow%2Fdockstore.org%2Fmike-gangl%2FSBG-unity-fractional-cover/versions/2/PLAIN-CWL/descriptor/%2FDockstore.cwl

Looking at the CWL definition file that it is referencing, there is a # in the definition that gets packed: http://awslbdockstorestack-lb-1429770210.us-west-2.elb.amazonaws.com:9998/api/ga4gh/trs/v2/tools/%23workflow%2Fdockstore.org%2Fmike-gangl%2FSBG-unity-fractional-cover/versions/2/PLAIN-CWL/descriptor/%2FDockstore.cwl#sbg-unity-fractional-cover

I tried running the cwltool --pack --debug on my local system saving the above as local files and can reproduce the error. When I remove the # from the run key #sbg-unity-fractional-cover it then tries to pack a file that doesn't exist. So it seems the # is indicating another workflow and might be a relative path?

I also found a few open issues for the cwltool that may explain what we are running into:

@LucaCinquini
Copy link
Collaborator Author

Maybe we can ask @mike-gangl if that # can be avoided while writing CWL workflows and string them in Dockstore. Or maybe we store them in GitHub?

@nikki-t
Copy link
Collaborator

nikki-t commented Sep 3, 2024

@LucaCinquini - I think that makes sense. @mike-gangl did mention that long term there will be a docker registry owned by the system and this would remove account numbers from the CWL definition files.

I also did a test to see if the overrides would populate a nested workflow and I was able to run a successful (small) test using this workflow file: https://raw.githubusercontent.com/unity-sds/unity-sps-workflows/186-ecr-cwl-dag/demos/echo_message_ecr_nested.cwl

I can confirm that the following overrides gets passed to the nested workflow and the docker container is pulled and executed.

"cwltool:overrides": {
            "https://raw.githubusercontent.com/unity-sds/unity-sps-workflows/186-ecr-cwl-dag/demos/echo_message_ecr.cwl": {
                "requirements": {"DockerRequirement": {"dockerPull": ecr_uri}}
            }
        }

I did not update the Python script that executes the CWL DAGs, I think we should regroup on the requirements for this so that we can figure out a true test of the nested workflow overrides since it seems like there are nested, nested workflows.

It would also be good to understand how the end user will pass in the nested workflow and the best way to determine the list of nested workflows. Maybe we can also tag up on the context for these in the larger Unity SDS system?

@nikki-t
Copy link
Collaborator

nikki-t commented Sep 4, 2024

Documenting solutions tried and results.

dockerLoad command: https://www.commonwl.org/v1.0/CommandLineTool.html#DockerRequirement

  • Executes docker load which loads a container image from a tar archive, file, or STDIN. This doesn't apply to our use case.

environment variables: https://www.commonwl.org/user_guide/topics/environment-variables.html

  • Environment variables are passed into the task and cannot be used by the CWL definition.
  • Tested and got: **INFO** ['docker', 'pull', '$ENV_CONTAINER_REGISTRY'] in the logs. The environment variable ENV_CONTAINER_REGISTRY was not parsed.
  • There are no Docker CLI environment variables that can be used to indicate container registry: https://docs.docker.com/reference/cli/docker/#environment-variables
  • CWL tested:
#!/usr/bin/env cwl-runner
cwlVersion: v1.2
class: CommandLineTool
baseCommand: echo
arguments: [$(inputs.message)]
requirements:
  EnvVarRequirement:
    envDef:
      ENV_CONTAINER_REGISTRY: $(inputs.ecr_uri)

hints:
  DockerRequirement:
    dockerPull: $ENV_CONTAINER_REGISTRY

inputs:
  message:
    type: string
  ecr_uri:
    type: string

outputs:
  the_output:
    type: stdout
stdout: echo_message.txt

InlineJavascriptRequirement JavaScript to override values: https://www.commonwl.org/user_guide/topics/expressions.html

  • Can only be used where parameter references are allowed and this is not the case for dockerPull.

cwl:overrideswill override nested workflows: https://github.com/common-workflow-language/cwltool?tab=readme-ov-file#overriding-workflow-requirements-at-load-time

  • Requires associating container image URI with nested workflows. So this is possible but requires managing these associations somehow.

@LucaCinquini LucaCinquini moved this from In Progress to Done in Unity Project Board Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

No branches or pull requests

3 participants