-
Notifications
You must be signed in to change notification settings - Fork 247
SRA tools docker
The NCBI SRA Toolkit is now maintaining a Docker image ncbi/sra-tools
When deploying the image in a container VM, please use the following path for the Docker Hub registry: registry.hub.docker.com/ncbi/sra-tools
The toolkit expects to find the preconfigured settings file in
${HOME}/.ncbi/user-settings.mkfg
. During docker build
, the preconfigured
settings file is put in /root/.ncbi/user-settings.mkfg
. If you change (or your
workflow engine changes) the HOME
environment variable, you should move or
copy /root/.ncbi/user-settings.mkfg
to ${HOME}/.ncbi/user-settings.mkfg
.
Otherwise, you will get a message saying that the toolkit requires
configuration.
For example:
% mkdir $HOME/.ncbi
% docker run -t --rm -v ${HOME}:${HOME}:rw -w ${HOME} --env HOME ncbi/sra-tools cp /root/.ncbi/user-settings.mkfg .ncbi
And you can verify the path for the configuration file with
vdb-config -o n NCBI_SETTINGS
, e.g.
% docker run -t --rm -v ${HOME}:${HOME}:rw -w ${HOME} --env HOME ncbi/sra-tools vdb-config -o n NCBI_SETTINGS
NCBI_SETTINGS = "/home/user/.ncbi/user-settings.mkfg"
When running multiple related jobs on clusters, you can (and probably
should) point every instance to the same configuration file (e.g. use a common
HOME
). Additionally, you can (and probably should) create a common repository
for SRA data and reference files. For example:
% mkdir /mnt/bigdisk/sra-data-repo
% mkdir /mnt/bigdisk/sratools-home/.ncbi
% HOME=/mnt/bigdisk/sratools-home docker run -t --rm -v /mnt/bigdisk/sra-data-repo:/repo:rw -v ${HOME}:${HOME}:rw -w ${HOME} --env HOME ncbi/sra-tools
# cp /root/.ncbi/user-settings.mkfg .ncbi
# vdb-config --set /repository/user/main/public/root="/repo"
# prefetch ERR036591
... snip ...
# ls -l /repo/sra
total 2165156
-rw-r--r-- 1 root root 1807665268 Sep 2 15:28 ERR036591.sra
-rw-r--r-- 1 root root 17284537 Sep 2 15:28 ERR036591.sra.vdbcache
After prefetch
-ing all the data, you can run all the processing jobs as:
% HOME=/mnt/bigdisk/sratools-home docker run -t --rm -v /mnt/bigdisk/sra-data-repo:/repo:r -v ${HOME}:${HOME}:rw -w ${HOME} --env HOME ncbi/sra-tools
# fastq-dump ERR036591
% docker run -t --rm -v $PWD:/output:rw -w /output ncbi/sra-tools fasterq-dump -e 2 -p SRR10985476
Unable to find image 'ncbi/sra-tools:latest' locally
latest: Pulling from ncbi/sra-tools
c9b1b535fdd9: Already exists
0a6856f8fd06: Pull complete
2d9bc7db21a2: Pull complete
3de524257044: Pull complete
Digest: sha256:631578b15625cc5390928772f1bf945847ce2981a81a95042729a47579396099
Status: Downloaded newer image for ncbi/sra-tools:latest
lookup :|-------------------------------------------------- 100%
merge : 16319508
join :|-------------------------------------------------- 100%
concat :|-------------------------------------------------- 100%
spots read : 14,965,183
reads read : 14,965,183
reads written : 14,965,183
Please note these suggested options included in the examples:
- creating a host volume to write to:
-v $PWD:/output:rw
- setting the container working directory to the host volume:
-w /output
Most tools write to the current working directory unless told otherwise, and you probably do not want the tools to write into the container's file system. So, please set the working directory to a host volume.
% docker run -t --rm -v $PWD:/output:rw -w /output ncbi/sra-tools prefetch SRR10985476
2020-06-23T18:07:35 prefetch.2.10.8: 1) Downloading 'SRR10985476'...
2020-06-23T18:07:35 prefetch.2.10.8: Downloading via HTTPS...
2020-06-23T18:07:45 prefetch.2.10.8: HTTPS download succeed
2020-06-23T18:07:45 prefetch.2.10.8: 'SRR10985476' is valid
2020-06-23T18:07:45 prefetch.2.10.8: 1) 'SRR10985476' was downloaded successfully
2020-06-23T18:08:27 prefetch.2.10.8: 'SRR10985476' has 454 unresolved dependencies
2020-06-23T18:08:27 prefetch.2.10.8: 2) Downloading 'ncbi-acc:NC_000001.11?vdb-ctx=refseq'...
2020-06-23T18:08:27 prefetch.2.10.8: Downloading via HTTPS...
2020-06-23T18:08:33 prefetch.2.10.8: HTTPS download succeed
2020-06-23T18:08:33 prefetch.2.10.8: 2) 'ncbi-acc:NC_000001.11?vdb-ctx=refseq' was downloaded successfully
...
2020-06-23T18:10:25 prefetch.2.10.8: 455) Downloading 'ncbi-acc:NW_004504305.1?vdb-ctx=refseq'...
2020-06-23T18:10:25 prefetch.2.10.8: Downloading via HTTPS...
2020-06-23T18:10:25 prefetch.2.10.8: HTTPS download succeed
2020-06-23T18:10:25 prefetch.2.10.8: 455) 'ncbi-acc:NW_004504305.1?vdb-ctx=refseq' was downloaded successfully
% docker run -t --rm -v $PWD:/output:rw -w /output ncbi/sra-tools fasterq-dump -p SRR10985476
lookup :|-------------------------------------------------- 100%
merge : 17976103
join :|-------------------------------------------------- 100%
concat :|-------------------------------------------------- 100%
spots read : 14,965,183
reads read : 14,965,183
reads written : 14,965,183
Please note that both commands are using the same host volume for the working directory. This allows the files that prefetch
retrieved to be found by fasterq-dump
.
We have seen TLS errors when running on AWS, like these:
2020-06-19T15:50:53 prefetch.2.10.7: Downloading via HTTPS...
2020-06-19T15:50:53 prefetch.2.10.7 sys: mbedtls_ssl_get_verify_result returned 0x8 ( !! The certificate is not correctly signed by the trusted CA )
2020-06-19T15:50:53 prefetch.2.10.7 int: connection failed while opening file within cryptographic module - Cannot KClientHttpRequestGET: /scratch/SRR5709848/SRR5709848.sra
2020-06-19T15:50:53 prefetch.2.10.7: HTTPS download failed
The solution is to make the host's certificates visible inside the container:
docker run -v /etc/pki:/etc/pki:ro -v /etc/ssl:/etc/ssl:ro ...