-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizing File Localization to Avoid Excess Downloads #671
Comments
Problem/Solution Our TES implementation does not mount entire storage containers (considered a security risk when TES is used for shared groups, one of our primary use cases), nor subpaths of containers (currently would require installing drivers on every compute node), on the compute nodes running the tasks. Further, the TES spec doesn't seem to have the concept of an execution directory existing beyond task completion (for CoA, that concept comes from Cromwell), so supporting mounted storage has to be a configurable opt-in (probably in the deployment configuration). Alternatives |
Thank you for the clarification! I noticed that the dockerRoot value is set in the cromwell-application.conf and assumed that it is actually mounted to the task VM. backend { Also, I can see that the logs in my task during execution are using the /cromwell-executions directory where I placed my input files. 2023-06-04 18:25:24,576 INFO - TesAsyncBackendJobExecutionActor [UUID(29b94176)ExomeGermlineSingleSample.PairedFastQsToUnmappedBAM:NA:1]: `/gatk/gatk --java-options "-Xmx19g" These files were copied using the URL to the input directory, even though I assumed it was already mounted there. total_bytes=0 && echo $(date +%T) && path='/cromwell-executions/fastq/R2_001.fastq.gz' && url='https://coa.blob.core.windows.net/cromwell-executions/fastq/R2_001.fastq.gz?sv=SAS' && blobxfer download --storage-url "$url" --local-path "$path" --chunk-size-bytes 104857600 --rename --include 'fastq/R2_001.fastq.gz' So, what would be the best approach for me to reduce the number of files being copied from the Storage Account to the VM executing tasks? The only solution I can think of is to combine the execution of tasks that use the same or similar inputs/outputs. However, this approach is labor-intensive and prone to errors. |
That container is currently mounted to the VM Cromwell is running in, so it sees that container as a local file system, which means that everything Cromwell accesses directly will not require uploading/downloading until the entire workflow is complete and you are collecting your results if you pre-stage your inputs as you describe. However, tasks that run through the backend (instead of inside of Cromwell itself) will still involve downloading/uploading (which is the reason that intermediate files created during the tasks aren't found in the We can certainly look into mounting (something I've personally wanted to test to see how it would affect overall costs in terms of both time and spend) but I'm not certain when we could start on it. |
Thanks again. I'm looking forward to hearing more about this because the downloading/uploading process is taking almost the same amount of time as the actual computation itself. |
I agree that combining tasks would not be the best idea. The VMs we use for compute node do appear to have blobfuse installed, but the tasks run in a container that is not provided access to the any of the fuse drivers, so at this point your stuck with the file transfers. You might try selecting vm_sizes that have faster networking for your tasks. |
Is it possible to specify the mount configuration for the pools in the config file, specifically in src/deploy-cromwell-on-azure/scripts/env-04-settings.txt? This would help save a lot of trouble. Additionally, we can add this mount to the Docker job image, further streamlining the process or just mount it to the $AZ_BATCH_TASK_WORKING_DIR/wd |
Right now there's no provision for the compute nodes to mount anything, unless that is done inside of the tasks themselves. I'm looking at what a solution might look like, but yes, some combination of configuration and/or task backend parameters will be involved. We recently added TES 1.1 "streamable" flag (which will implement the Cromwell "localization_optional" flag) but Cromwell hasn't implemented support for that in TES, so it still won't prevent the downloads of your inputs. Ultimately, you will be waiting for them to add support for it when using its TES backend implementation. All we can do here is facilitate mounting your specified container. |
Note that Azure (who supplies the blob fuse filesystem driver that Cromwell currently uses) recommends NOT sharing that blob container (not the file system per-se, but the entire blob container) with any other agent (such as the tasks that run through CoA/TES) making any changes to those blobs. I would recommend waiting for #694 (or something similar) to be implemented before moving forward with any effort to localize file systems on the compute nodes. |
TL;DR - If implemented, this would optionally add filesystem mounts using NFS or similar to Cromwell and the Azure Batch compute nodes. It remains to be determined whether this can scale to accommodate possibly thousands of simultaneous mounts. |
Duplicate of #213 |
Problem:
I am exploring options to use a local file path on the storage account for task execution without the need to localize the input files. I attempted to place the input files into the /cromwell-executions path, which is mounted to the task VM. During execution, I noticed that the task uses a path within /cromwell-executions, but the download script still downloads all my input files for the task.
Solution:
Upon checking BatchScheduler.cs, it appears that it collects all input files, including additionalInputFiles, for downloading, even when the local path is available.
Describe alternatives you've considered
Please advise if it is possible to use the "streamable" or "localization_optional" flags for the input files to avoid excessive file downloading. I have seen discussions in the TES repository but I'm unsure if CoA currently supports these flags.
Additional context
In general, the goal is to utilize an Azure Storage account as a mount for the input files and exclude unnecessary file localization. I noticed that Cromwell recently added support for the Blobs filesystem, but I am uncertain if it would help resolve this issue.
The text was updated successfully, but these errors were encountered: