Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing File Localization to Avoid Excess Downloads #671

Open
superbsky opened this issue Jun 5, 2023 · 10 comments
Open

Optimizing File Localization to Avoid Excess Downloads #671

superbsky opened this issue Jun 5, 2023 · 10 comments
Labels
enhancement New feature or request tobegroomed Add this label while creating new issues to get issues prioritized on the backlog

Comments

@superbsky
Copy link

Problem:
I am exploring options to use a local file path on the storage account for task execution without the need to localize the input files. I attempted to place the input files into the /cromwell-executions path, which is mounted to the task VM. During execution, I noticed that the task uses a path within /cromwell-executions, but the download script still downloads all my input files for the task.

Solution:
Upon checking BatchScheduler.cs, it appears that it collects all input files, including additionalInputFiles, for downloading, even when the local path is available.

Describe alternatives you've considered
Please advise if it is possible to use the "streamable" or "localization_optional" flags for the input files to avoid excessive file downloading. I have seen discussions in the TES repository but I'm unsure if CoA currently supports these flags.

Additional context
In general, the goal is to utilize an Azure Storage account as a mount for the input files and exclude unnecessary file localization. I noticed that Cromwell recently added support for the Blobs filesystem, but I am uncertain if it would help resolve this issue.

@BMurri
Copy link
Collaborator

BMurri commented Jun 5, 2023

Problem/Solution
Our TES does not run any tasks on the same machine as Cromwell is running, so the mounts available to Cromwell are not available to the tasks.

Our TES implementation does not mount entire storage containers (considered a security risk when TES is used for shared groups, one of our primary use cases), nor subpaths of containers (currently would require installing drivers on every compute node), on the compute nodes running the tasks. Further, the TES spec doesn't seem to have the concept of an execution directory existing beyond task completion (for CoA, that concept comes from Cromwell), so supporting mounted storage has to be a configurable opt-in (probably in the deployment configuration).

Alternatives
The downloads collected by BatchScheduler.cs uses the path inside of the executor docker container as its definition of "local path", so the download would be required regardless (without implementing mounting). Cromwell currently doesn't currently provide localization_optional to any backend other than GCE. If Cromwell were to support localization_optional on TES, it would do so by setting streamable, and we would have to implement support for that by skipping the download (indicating that the task knows how to and will access the content from the URL itself, which it would need to know, as described here).

@BMurri BMurri added needs discussion Team discussion is needed and removed needs discussion Team discussion is needed labels Jun 5, 2023
@superbsky
Copy link
Author

Thank you for the clarification!

I noticed that the dockerRoot value is set in the cromwell-application.conf and assumed that it is actually mounted to the task VM.

backend {
default = "TES"
providers {
TES {
actor-factory = "cromwell.backend.impl.tes.TesBackendLifecycleActorFactory"
config {
filesystems {
http { }
}
root = "/cromwell-executions"
dockerRoot = "/cromwell-executions"
...

Also, I can see that the logs in my task during execution are using the /cromwell-executions directory where I placed my input files.

2023-06-04 18:25:24,576 INFO - TesAsyncBackendJobExecutionActor [UUID(29b94176)ExomeGermlineSingleSample.PairedFastQsToUnmappedBAM:NA:1]: `/gatk/gatk --java-options "-Xmx19g"
FastqToSam
--FASTQ /cromwell-executions/fastq/R1_001.fastq.gz
--FASTQ2 /cromwell-executions/fastq/R2_001.fastq.gz
...

These files were copied using the URL to the input directory, even though I assumed it was already mounted there.

total_bytes=0 && echo $(date +%T) && path='/cromwell-executions/fastq/R2_001.fastq.gz' && url='https://coa.blob.core.windows.net/cromwell-executions/fastq/R2_001.fastq.gz?sv=SAS' && blobxfer download --storage-url "$url" --local-path "$path" --chunk-size-bytes 104857600 --rename --include 'fastq/R2_001.fastq.gz'
...

So, what would be the best approach for me to reduce the number of files being copied from the Storage Account to the VM executing tasks? The only solution I can think of is to combine the execution of tasks that use the same or similar inputs/outputs. However, this approach is labor-intensive and prone to errors.

@BMurri
Copy link
Collaborator

BMurri commented Jun 5, 2023

That container is currently mounted to the VM Cromwell is running in, so it sees that container as a local file system, which means that everything Cromwell accesses directly will not require uploading/downloading until the entire workflow is complete and you are collecting your results if you pre-stage your inputs as you describe. However, tasks that run through the backend (instead of inside of Cromwell itself) will still involve downloading/uploading (which is the reason that intermediate files created during the tasks aren't found in the /cromwell-executions container) today.

We can certainly look into mounting (something I've personally wanted to test to see how it would affect overall costs in terms of both time and spend) but I'm not certain when we could start on it.

@superbsky
Copy link
Author

Thanks again. I'm looking forward to hearing more about this because the downloading/uploading process is taking almost the same amount of time as the actual computation itself.

@BMurri
Copy link
Collaborator

BMurri commented Jun 5, 2023

I agree that combining tasks would not be the best idea. The VMs we use for compute node do appear to have blobfuse installed, but the tasks run in a container that is not provided access to the any of the fuse drivers, so at this point your stuck with the file transfers.

You might try selecting vm_sizes that have faster networking for your tasks.

@superbsky
Copy link
Author

Is it possible to specify the mount configuration for the pools in the config file, specifically in src/deploy-cromwell-on-azure/scripts/env-04-settings.txt? This would help save a lot of trouble. Additionally, we can add this mount to the Docker job image, further streamlining the process or just mount it to the $AZ_BATCH_TASK_WORKING_DIR/wd

@BMurri
Copy link
Collaborator

BMurri commented Jun 16, 2023

Right now there's no provision for the compute nodes to mount anything, unless that is done inside of the tasks themselves. I'm looking at what a solution might look like, but yes, some combination of configuration and/or task backend parameters will be involved.

We recently added TES 1.1 "streamable" flag (which will implement the Cromwell "localization_optional" flag) but Cromwell hasn't implemented support for that in TES, so it still won't prevent the downloads of your inputs. Ultimately, you will be waiting for them to add support for it when using its TES backend implementation. All we can do here is facilitate mounting your specified container.

@BMurri
Copy link
Collaborator

BMurri commented Feb 29, 2024

Note that Azure (who supplies the blob fuse filesystem driver that Cromwell currently uses) recommends NOT sharing that blob container (not the file system per-se, but the entire blob container) with any other agent (such as the tasks that run through CoA/TES) making any changes to those blobs. I would recommend waiting for #694 (or something similar) to be implemented before moving forward with any effort to localize file systems on the compute nodes.

@BMurri BMurri added enhancement New feature or request tobegroomed Add this label while creating new issues to get issues prioritized on the backlog labels Nov 11, 2024
@BMurri
Copy link
Collaborator

BMurri commented Nov 11, 2024

TL;DR - If implemented, this would optionally add filesystem mounts using NFS or similar to Cromwell and the Azure Batch compute nodes. It remains to be determined whether this can scale to accommodate possibly thousands of simultaneous mounts.

@BMurri
Copy link
Collaborator

BMurri commented Nov 11, 2024

Duplicate of #213

@BMurri BMurri marked this as a duplicate of #213 Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request tobegroomed Add this label while creating new issues to get issues prioritized on the backlog
Projects
None yet
Development

No branches or pull requests

2 participants